其他分享
首页 > 其他分享> > 网络爬虫-学习记录(一)初步爬取豆瓣电影榜单

网络爬虫-学习记录(一)初步爬取豆瓣电影榜单

作者:互联网

一、任务

1.爬取豆瓣榜单第一的电影详细内容

2.爬取豆瓣近期热门榜单的所有电影详细内容

二、描述任务

1.url:https://maoyan.com/board

2.使用urllib库request模板中的urlopen函数获得请求数据,获取页面信息后运用beautifulSoup库定位HTML标签找到需要的网页信息(运用BeautifulSoup库中find和findAll函数进行标签定位查找)

3.进行异常处理

三、运用的库和模块

1.Urllib库的request模块

2.BeautifulSoup库的find函数、findAll函数

四、运行结果及说明

1. 说明:爬取的是第一部《1950他们正年轻》电影的详细信息

2,说明:爬取的是近期热榜的榜单电影

五、源码

1,

from urllib.request import urlopen

from urllib.error import HTTPError

from urllib.error import URLError

from bs4 import BeautifulSoup

import requests

try:

   html ="https://maoyan.com/board"

   headers = {

        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36",

    }

   resp = requests.get(html,headers = headers)

   html = BeautifulSoup(resp.content,'html.parser')

   #string编码格式输出

   #电影名称

   name1 = html.find('p',{'class':'name'}).string

   #主演

   stars = html.find('p',{'class':'star'}).string

   #上映时间

   releasetime1 = html.find('p',{'class':'releasetime'}).string

   #评分

   score = html.find('i',{'class':'integer'}).string + html.find('i',{'class':'fraction'}).string

   print("电影:" + name1)

   print("主演:" + stars)

   print("上映时间:" + releasetime1)

   print("评分:" + score)

except HTTPError as e:

    print(e)

except URLError as e:

    print('The server could not be found')

else:

print('It Worked!')

2

from urllib.request import urlopen

from urllib.error import HTTPError

from urllib.error import URLError

from bs4 import BeautifulSoup

import requests

try:

   html ="https://maoyan.com/board"

   headers = {

        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36",

    }

   resp = requests.get(html,headers = headers)

   html = BeautifulSoup(resp.content,'html.parser')

   dds = html.findAll('dd')

   #string编码格式输出

   for dd in dds:

       #电影名称

       name1 = dd.find('p',{'class':'name'}).string

       print("电影:" + name1)

       #主演

       stars = dd.find('p',{'class':'star'}).string

       print(stars)

       #上映时间

       releasetime1 = dd.find('p',{'class':'releasetime'}).string

       print("上映时间:" + releasetime1)

       #评分

       score = dd.find('i',{'class':'integer'}).string + dd.find('i',{'class':'fraction'}).string

       print("评分:" + score)

except HTTPError as e:

    print(e)

except URLError as e:

    print('The server could not be found')

else:

    print('It Worked!')

标签:string,class,html,爬虫,爬取,豆瓣,print,import,find
来源: https://blog.csdn.net/weixin_46490924/article/details/122514064