首页 > 其他分享> > 网络爬虫例子

网络爬虫例子

2022-06-08 11:01:10 作者：互联网

一、最简单的一个例子

代码如下：

import requests
url='http://www.nj29jt.net/ArticleShow.aspx?CateId=153&Id=2132'
res=requests.get(url)
res.encoding='utf-8'
print(res.text)  #res.text是网页的内容，就是在浏览器中查看网页源代码看到的内容

　　上面的 requests 就是用来访问网络获取信息的模块，其get方法用于获取网页的内容。

二、解析html文件

通过上面的 res.text 是一个纯文本的html页面内容，实际要获取页面中的信息，需要解析html文档。

这里可使用 BeautifulSoup 模块，该模块aconcada已内置，如果要安装，可以 pip install beautifulsoup4

下面是一个例子

import requests
from bs4 import BeautifulSoup  #BeautifulSoup模块用于解析html页面
url='http://www.nj29jt.net/ArticleShow.aspx?CateId=153&Id=2132'
res=requests.get(url)
res.encoding='utf-8'

soup = BeautifulSoup(res.text)   
print(soup.title) #输出网页的标题(含html标签)
print(soup.title.name) #输出html标签
print(soup.title.string) #输出标签的内容

再看如何获取页面中的表格（上面例子网页中的教师信息是在很多个html的table中），代码如下

import requests
from bs4 import BeautifulSoup  #BeautifulSoup模块用于解析html页面
url='http://www.nj29jt.net/ArticleShow.aspx?CateId=153&Id=2132'
res=requests.get(url)
res.encoding='utf-8'

soup = BeautifulSoup(res.text)   
alltables = soup.select('table') # 根据html的标签查找网页中所有的表格(表格元素的标签是table)，
print(type(alltables))  #alltables是一个 bs4.element.ResultSet对象，是一个结果集

table = alltables[0] #按序号获取第一个元素
print(type(table)) # bs4.element.Tag

tableStr = str(table) #转换为字符串。

for table in alltables:  #可以遍历表格
    pass

三、利用pandas将获取的网页中的table转换为 DataFrame，这样就可做更多处理，包括输出到excel中

例子代码如下：

import requests   # requests 模块用于获取网页信息
import pandas as pd
url='http://www.nj29jt.net/ArticleShow.aspx?CateId=153&Id=2132'
res=requests.get(url)
res.encoding='utf-8'

tables = pd.read_html(res.text) #将网页内容res.text传递给read_html方法，会自动解析出其中所有的表格，返回一个列表（DataFrame对象的列表）

print(len(tables))
tables[0].to_excel('user.xlsx',index=False) #输出到excel
tables[0] #输出第一个表格

df = pd.concat(tables) #合并多个表格
df.to_excel('alluser.xlsx',index=False) #输出到excel

for table in tables: #遍历
　　pass

实际上如果只是获取网页中的表格，pandas直接支持，并不需要使用requests模块，代码如下：

import pandas as pd  
url='http://www.nj29jt.net/ArticleShow.aspx?CateId=153&Id=2132'
tables = pd.read_html(url) #pandas可以直接根据url地址，访问网页，获取其中的表格
print(len(tables))
tables[0].to_excel('user.xlsx',index=False) #输出到excel
tables[0] #输出第一个表格

df = pd.concat(tables) #合并多个表格
df.to_excel('alluser.xlsx',index=False) #输出到excel

BeautifulSoup 模块

标签：tables,url,res,html,爬虫,例子,table,网络,requests
来源： https://www.cnblogs.com/51kata/p/16354848.html