使用 scrapy 爬取 微博热搜
作者:互联网
安装
pip install Scrapy
创建项目
scrapy startproject weiboHotSearch
创建爬虫
cd weiboHotSearch
scrapy genspider weibo s.weibo.com
编写Item
修改weiboHotSearch中的items.py,添加item
import scrapy
class WeibohotsearchItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
keyword = scrapy.Field()
url = scrapy.Field()
count = scrapy.Field()
编写爬虫
修改
start_urls
,注意为list格式使用
xpath
解析数据xpath语法可参考https://www.w3school.com.cn/xpath/xpath_syntax.asp
解析数据时,可运行
scrapy shell "https://s.weibo.com/top/summary"
调试xpath引入
Item
,将数据以Itme
对象返回执行
scrapy crawl weibo
运行爬虫运行结果如下:
weibo.py
的完整代码
import scrapy
from weiboHotSearch.items import WeibohotsearchItem
class WeiboSpider(scrapy.Spider):
name = 'weibo'
allowed_domains = ['s.weibo.com']
start_urls = ['https://s.weibo.com/top/summary']
def parse(self, response):
pass
for i in response.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[2]'):
keyword = i.xpath('a/text()').extract_first()
url = 'https://s.weibo.com'+i.xpath('a/@href').extract_first()
count = i.xpath('span/text()').extract_first()
print(keyword)
print(count)
# print(url)
item = WeibohotsearchItem()
item['keyword'] = keyword
item['url'] = url
item['count'] = count
yield item
保存数据
- 使用
Feed export
保存
使用以下命令即可将数据保存到items.json
中
scrapy crawl weibo -o items.json
cat items.json
使用
Item Pipeline
保存编写pipeline
修改
pipelines.py
,添加保存class WeibohotsearchPipeline(object): def __init__(self): self.f = open('items.csv','w') def process_item(self, item, spider): res = item['keyword']+','+item['count']+','+item['url']+"\n" self.f.write(res) return item
启用item pipline
将以下内容添加到
settings.py
中即可启用PiplineITEM_PIPELINES = { 'weiboHotSearch.pipelines.WeibohotsearchPipeline': 300, }
运行
scrapy crawl weibo cat items.csv
数据如下:
项目地址
https://gitee.com/yu-se/scrapy-test
参考文档
https://scrapy-chs.readthedocs.io/zh_CN/latest/
标签:xpath,weibo,items,爬取,item,scrapy,com,微博热 来源: https://www.cnblogs.com/lzyuid/p/12403151.html