首页 > 其他分享> > 使用 scrapy 爬取微博热搜

使用 scrapy 爬取微博热搜

2020-03-03 16:57:26 作者：互联网

安装

pip install Scrapy

创建项目

scrapy startproject weiboHotSearch

创建爬虫

cd weiboHotSearch
scrapy genspider weibo s.weibo.com

编写Item

修改weiboHotSearch中的items.py,添加item

import scrapy


class WeibohotsearchItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass
    keyword = scrapy.Field()
    url = scrapy.Field()
    count = scrapy.Field()

编写爬虫

修改start_urls,注意为list格式
使用xpath解析数据

xpath语法可参考https://www.w3school.com.cn/xpath/xpath_syntax.asp

解析数据时,可运行scrapy shell "https://s.weibo.com/top/summary"调试xpath
引入Item,将数据以Itme对象返回
执行scrapy crawl weibo运行爬虫

运行结果如下:

weibo.py的完整代码

import scrapy

from weiboHotSearch.items import WeibohotsearchItem
class WeiboSpider(scrapy.Spider):
    name = 'weibo'
    allowed_domains = ['s.weibo.com']
    start_urls = ['https://s.weibo.com/top/summary']

    def parse(self, response):
        pass
        for i in response.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[2]'):
            keyword = i.xpath('a/text()').extract_first()
            url = 'https://s.weibo.com'+i.xpath('a/@href').extract_first()
            count = i.xpath('span/text()').extract_first()
            print(keyword)
            print(count)
            # print(url)
            item = WeibohotsearchItem()
            item['keyword'] = keyword
            item['url'] = url
            item['count'] = count
            yield item

保存数据

使用Feed export 保存

使用以下命令即可将数据保存到items.json中

scrapy crawl weibo -o items.json
cat items.json

使用Item Pipeline保存

编写pipeline

修改pipelines.py,添加保存

class WeibohotsearchPipeline(object):
    def __init__(self):
        self.f = open('items.csv','w')

    def process_item(self, item, spider):
        res = item['keyword']+','+item['count']+','+item['url']+"\n"

        self.f.write(res)
        return item

启用item pipline

将以下内容添加到settings.py中即可启用Pipline

ITEM_PIPELINES = {
   'weiboHotSearch.pipelines.WeibohotsearchPipeline': 300,
}

运行
```
scrapy crawl weibo 
cat items.csv
```

数据如下:

项目地址

https://gitee.com/yu-se/scrapy-test

参考文档

https://scrapy-chs.readthedocs.io/zh_CN/latest/

标签：xpath,weibo,items,爬取,item,scrapy,com,微博热
来源： https://www.cnblogs.com/lzyuid/p/12403151.html

使用 scrapy 爬取 微博热搜

安装