0%

利用Apache爬取微博首页的推荐

循序渐进...

本文基于以下环境:

  • IntelliJ IDEA
  • Maven

新建项目

首先新建一个Maven项目:http

添加依赖包org.apache.httpcomponent:

pom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<dependencies>

<!--这里这里这里-->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.6</version>
</dependency>

</dependencies>
<groupId>org.example</groupId>
<artifactId>http</artifactId>
<version>1.0-SNAPSHOT</version>



</project>

编写源代码

先把主类弄出来,就当是平常写程序那样

第一步

我们以必应为例,在必应搜索框中输入”必应”,并复制地址栏的地址,发现”必应”变成了”%E5%BF%85%E5%BA%94”,问题不大,这是”必应”的url编码,了解curl命令并且在使用的经历中遇到乱码也是因为这个,中文字符按照gbk和utf-8编码转码成url导致的乱码,跑题了,回归正题:

我们编写一个静态方法get,入参为url,

Main.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
String url = "https://cn.bing.com/search?q=%E5%BF%85%E5%BA%94&qs=n&form=QBRE&sp=-1&pq=biy&sc=9-3&sk=&cvid=A076848611014A2986EBDF37E85A0AF4";
System.out.println(get(url));

}
public static String get(String url) throws IOException {
CloseableHttpClient httpClient= HttpClients.createDefault();
HttpGet httpGet = new HttpGet(url);
HttpResponse httpResponse = httpClient.execute(httpGet);
return EntityUtils.toString(httpResponse.getEntity());
}

}

此处Len遇到了一个报错

不支持发行版本5

这个问题不严重,搜索引擎可以回答并解决这个问题,修复后出现一大片以

<!DOCTYPE html>

开头的返回信息,很熟悉是不是?跟cURL命令的返回很相似。

第二步

按照上一步的思路,我们开始对微博下手,需要给get方法添加Headers

这里用到一些Map类的方法,理解一下就好:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public static String getWithHeader(String url , Map<String,String> headers) throws IOException {
HttpGet httpGet = new HttpGet(url);
Header [] header = getHeaders(headers);
httpGet.setHeaders(header);
HttpResponse httpResponse = httpClient.execute(httpGet);
return EntityUtils.toString(httpResponse.getEntity());
}

private static Header[] getHeaders(Map<String,String> headers){
Header[] header = null;
if(headers != null){
header = new Header[headers.size()];
int index = 0;
for(Map.Entry<String,String> entry : headers.entrySet()){
header[index++] = new BasicHeader(entry.getKey(),entry.getValue());
}
}
return header;
}

主函数嘛,就:

1
2
3
4
5
6
      String url = "https://weibo.com/u//*你的id*//home?wvr=5&lf=reg";
Map<String,String> map = new LinkedHashMap<>();
map.put("Cookie","/*这里换成你的小饼干*/");
/*如果失败了就把所有的header像上面那样添加上去*/

System.out.println(getWithHeader(url,map));

第三步

上一步我们获得了html格式的文本,利用查找比对,发现实际上微博内容是保存在<script><Script>里的,有一个js的全局函数FM.views(),入参是个json格式的字符串,这个Object有个Key名为”html”,发现它是html字符串,从这个角度出发,我们分三小步解决:

导入html解析和json解析依赖包

1
2
3
4
5
6
7
8
9
10
11
12
<!--解析html-->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.12.1</version>
</dependency>
<!--解析json-->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.28</version>
</dependency>

解析主页面html

我们利用for循环获取每个<script>的data,返回字符串,判断字符串有没有包含“pl.content.homefeed.index”

这个就是微博内容所在的脚本包含的json的”ns”的值

1
2
3
4
5
6
7
8
9
10
Document document = Jsoup.parse(s);
Elements elements = document.getElementsByTag("script"); //获取所有标签为script的元素

for(Element script : elements){
String t=script.data();
if(t.contains("pl.content.homefeed.index")){//寻找目标内容
//这里处理获取到的值,也就是把下一步的源码插入进来
break;
}
}

解析获取到的json

我们获取到的t是

FM.view(/*json*/)

所以我们只要把前面的东西和最后一个括号去掉就是一个json字符串了,利用String类自带的函数substring函数即可。
这个json的结构如下:

1
2
3
4
5
6
7
{
"ns": "pl.content.homefeed.index",
"domid": "v6_pl_content_homefeed",
"css": ["style/css/module/tab/comb_WB_tab_profile.css?version=886df3a0bec79202", "style/css/module/list/comb_WB_feed.css?version=886df3a0bec79202"],
"js": ["home/js/pl/content/homefeed/index.js?version=fcd2eaf6d3a89ceb"],
"html": //html字符串,就是要获取的微博内容
}

所以我们要把key为html的value取出来,并且把它变成一个完整的html字符串,不然会乱码
1
2
3
4
5
6
7
8
9
10
11
12
JSONObject jsonObject=JSONObject.parseObject( t.substring(8,t.lastIndexOf(")")));

String html = "<!DOCTYPE html>\n" +
"<html lang=\"en\">\n" +
"<head>\n" +
" <meta charset=\"UTF-8\">\n" +
" <title>Title</title>\n" +
"</head>\n" +
"<body>\n" +
jsonObject.get("html").toString() +
"\n</body>\n" +
"</html>";

html字符串其实不用手打,项目里找个目录新建一个html,idea帮你搞定,然后你只需要Ctrl+C,Ctrl+V,再把jsonObject.get("html").toString()放在中间就好了

解析html获取信息:

我们打开微博,按F12查看网页源代码,
然后点击开发者界面的左上角(因浏览器而异),选择页面中的元素并监视

  • Chrome浏览器快捷键Ctrl+Shift+C

然后点击微博正文获取到一些信息,我们主要看这些标签的类

比如每条微博的板块:

1
2
3
<div class="WB_feed_detail clearfix" node-type="feed_content" style="background-image:url(//vip.storage.weibo.com/feed_cover/star_1108_pc_x2.png?version=202004051333)">
<!--省略中间部分-->
</div>

它是WB_feed_detail类和clearfix类的,我们可以通过类别选择器选择它们,然后再选取我们要的信息,我们的参数是WB_feed_detail,因为经过查找,有些不属于微博内容的部分也是clearfix类的

1
2
document = Jsoup.parse(html);//解析html,这里循环利用document节省内存
Elements WB_feed_details = document.getElementsByClass("WB_feed_detail");

同理我们可以获取微博昵称,微博正文,微博来源所在的标签的类名,获取我们想要的信息,利用增强for循环输出即可

1
2
3
4
5
6
7
8
9
10
11
for(Element WB_feed_detail : WB_feed_details){
Elements W_f14s = WB_feed_detail.getElementsByClass("W_f14");//W_f14类包含ID及正文内容
Elements WB_froms = WB_feed_detail.getElementsByClass("WB_from");//微博来源(时间及客户端,超话等等)
for(Element W_f14 : W_f14s){
System.out.println(W_f14.text());
}
for(Element WB_from :WB_froms){
System.out.println(WB_from.text());
}
System.out.println();
}

完成!

下面是完整源码,把收集到的100条微博写入文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import com.alibaba.fastjson.JSONObject;
import org.apache.http.Header;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicHeader;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.LinkedHashMap;
import java.util.Map;

public class Main {
static CloseableHttpClient httpClient= HttpClients.createDefault();
public static void main(String[] args) throws IOException {
String url = "https://weibo.com/u//*你的id*//home?wvr=5&lf=reg";
Map<String,String> map = new LinkedHashMap<>();
map.put("Cookie","/*这里换成你的小饼干*/");

File file = new File("collection.txt");
if(file.exists()){
file.delete();
}

FileOutputStream fout = new FileOutputStream(file);
int times = 100;
while (times > 0) {
String s = getWithHeader(url, map);
Document document = Jsoup.parse(s);
Elements elements = document.getElementsByTag("script"); //获取所有标签为script的元素
for (Element script : elements) {
String t = script.data();
if (t.contains("pl.content.homefeed.index")) {//寻找目标内容
JSONObject jsonObject = JSONObject.parseObject(t.substring(8, t.lastIndexOf(")"))); //截取出来的字符串为json格式

String html = "<!DOCTYPE html>\n" +
"<html lang=\"en\">\n" +
"<head>\n" +
" <meta charset=\"UTF-8\">\n" +
" <title>Title</title>\n" +
"</head>\n" +
"<body>\n" +
jsonObject.get("html").toString() +
"\n</body>\n" +
"</html>";//生成html字符串
document = Jsoup.parse(html);//解析html,这里循环利用document节省内存
Elements WB_feed_details = document.getElementsByClass("WB_feed_detail");
for (Element WB_feed_detail : WB_feed_details) {
Elements W_f14s = WB_feed_detail.getElementsByClass("W_f14");//按F12分析出来的,W_f14类包含ID及正文内容
Elements WB_froms = WB_feed_detail.getElementsByClass("WB_from");//微博来源(时间及客户端,超话等等)
for (Element W_f14 : W_f14s) {
fout.write(W_f14.text().getBytes(StandardCharsets.UTF_8));
fout.write("\n".getBytes(StandardCharsets.UTF_8));
}
for (Element WB_from : WB_froms) {
fout.write(WB_from.text().getBytes(StandardCharsets.UTF_8));
fout.write("\n".getBytes(StandardCharsets.UTF_8));
}
fout.write("\n".getBytes(StandardCharsets.UTF_8));
fout.write("\n".getBytes(StandardCharsets.UTF_8));
times--;
}
break;
}
}
}
fout.close();
}


public static String getWithHeader(String url , Map<String,String> headers) throws IOException {
HttpGet httpGet = new HttpGet(url);
Header [] header = getHeaders(headers);
httpGet.setHeaders(header);
HttpResponse httpResponse = httpClient.execute(httpGet);
return EntityUtils.toString(httpResponse.getEntity());
}

private static Header[] getHeaders(Map<String,String> headers){
Header[] header = null;
if(headers != null){
header = new Header[headers.size()];
int index = 0;
for(Map.Entry<String,String> entry : headers.entrySet()){
header[index++] = new BasicHeader(entry.getKey(),entry.getValue());
}
}
return header;
}
}
-----------看到底线啦 感谢您的阅读-------------