网络爬虫-学习记录(一)初步爬取豆瓣电影榜单
作者:互联网
一、任务
1.爬取豆瓣榜单第一的电影详细内容
2.爬取豆瓣近期热门榜单的所有电影详细内容
二、描述任务
1.url:https://maoyan.com/board
2.使用urllib库request模板中的urlopen函数获得请求数据,获取页面信息后运用beautifulSoup库定位HTML标签找到需要的网页信息(运用BeautifulSoup库中find和findAll函数进行标签定位查找)
3.进行异常处理
三、运用的库和模块
1.Urllib库的request模块
2.BeautifulSoup库的find函数、findAll函数
四、运行结果及说明
1. 说明:爬取的是第一部《1950他们正年轻》电影的详细信息
2,说明:爬取的是近期热榜的榜单电影
五、源码
1,
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
import requests
try:
html ="https://maoyan.com/board"
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36",
}
resp = requests.get(html,headers = headers)
html = BeautifulSoup(resp.content,'html.parser')
#string编码格式输出
#电影名称
name1 = html.find('p',{'class':'name'}).string
#主演
stars = html.find('p',{'class':'star'}).string
#上映时间
releasetime1 = html.find('p',{'class':'releasetime'}).string
#评分
score = html.find('i',{'class':'integer'}).string + html.find('i',{'class':'fraction'}).string
print("电影:" + name1)
print("主演:" + stars)
print("上映时间:" + releasetime1)
print("评分:" + score)
except HTTPError as e:
print(e)
except URLError as e:
print('The server could not be found')
else:
print('It Worked!')
2,
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
import requests
try:
html ="https://maoyan.com/board"
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36",
}
resp = requests.get(html,headers = headers)
html = BeautifulSoup(resp.content,'html.parser')
dds = html.findAll('dd')
#string编码格式输出
for dd in dds:
#电影名称
name1 = dd.find('p',{'class':'name'}).string
print("电影:" + name1)
#主演
stars = dd.find('p',{'class':'star'}).string
print(stars)
#上映时间
releasetime1 = dd.find('p',{'class':'releasetime'}).string
print("上映时间:" + releasetime1)
#评分
score = dd.find('i',{'class':'integer'}).string + dd.find('i',{'class':'fraction'}).string
print("评分:" + score)
except HTTPError as e:
print(e)
except URLError as e:
print('The server could not be found')
else:
print('It Worked!')
标签:string,class,html,爬虫,爬取,豆瓣,print,import,find 来源: https://blog.csdn.net/weixin_46490924/article/details/122514064