爬虫使用采坑记录
作者:互联网
爬虫使用采坑记录
之前因为项目需要做了一个简单的x宝爬虫项目,因为第一次使用爬虫。许多东西都是模糊不清直接百度copy代码去做。后来采用htmlunit+jsoup结合终于拿到了数据。方法大概如下。
public static Document getTaobaoDetail(String url) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException {
//构造一个webClient 模拟Chrome 浏览器
WebClient webClient = new WebClient(BrowserVersion.CHROME);
//WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER);//IE 不出东西
//WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
//WebClient webClient = new WebClient(BrowserVersion.EDGE);
//屏蔽日志信息
LogFactory.getFactory().setAttribute(“org.apache.commons.logging.Log”,
“org.apache.commons.logging.impl.NoOpLog”);
java.util.logging.Logger.getLogger(“com.gargoylesoftware”).setLevel(Level.OFF);
//支持JavaScript 因为电商网页大部分是js动态加载。所以需要支持js
webClient.getOptions().setJavaScriptEnabled(true);
//设置主页
webClient.getOptions().setHomePage(“url”);//这里写需要爬取的网页
webClient.getBrowserVersion().setUserAgent(“Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36”);
//是否支持css支持
webClient.getOptions().setCssEnabled(false);
//本地ActiveX
webClient.getOptions().setActiveXNative(false);
//改变该WebClient行为脚本时出现错误
webClient.getOptions().setThrowExceptionOnScriptError(false);
//指定是否在出现故障状态代码时抛出异常
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.addRequestHeader(“accept”,“text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8”);
webClient.addRequestHeader(“accept-encoding”,“gzip, deflate, br”);
webClient.addRequestHeader(“accept-language”,“zh-CN,zh;q=0.9”);
webClient.addRequestHeader(“Connection”,“keep-alive”);
// webClient.addRequestHeader(“referer”,“可以去浏览器找,用不上”);
// webClient.addRequestHeader(“cookie”,“cookie”);//设置cookie方法大概是这样
String[] cookies=cookiestr.split(";");
for (int i = 0; i < cookies.length; i++) {
String str=cookies[i];
Cookie cookie = new Cookie(“s.taobao.com”, str.split("=")[0], str.split("=")[1] );
webClient.getCookieManager().addCookie(cookie);
}
int status = webClient.getPage(url).getWebResponse().getStatusCode();
System.out.println(“网页返回的状态码是===========”+status);
if(status302){//302貌似是重定向到新页面 这里如果有302代表数据没有拿到
Thread.sleep(2000);
}`
Page page = webClient.getPage(url);
URL redictUrl = page.getUrl();
System.out.println("重定向的url是===="+redictUrl);
System.out.println(“page是==============”+page.toString());
HtmlPage rootPage = webClient.getPage(url);
//设置一个运行JavaScript的时间
webClient.waitForBackgroundJavaScript(10000);
try {
Thread.sleep(10000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
String html = rootPage.asXml();
//在拿到页面后使用jsoup解析页面
Document document = Jsoup.parse(html);
webClient.close();
return document;
}
标签:addRequestHeader,采坑,记录,url,爬虫,cookie,webClient,getOptions,WebClient 来源: https://blog.csdn.net/plkiop911/article/details/86576483