其他分享
首页 > 其他分享> > 使用 scrapy 爬取 微博热搜

使用 scrapy 爬取 微博热搜

作者:互联网

安装

pip install Scrapy

创建项目

scrapy startproject weiboHotSearch

UTOOLS1583039785087.png

创建爬虫

cd weiboHotSearch
scrapy genspider weibo s.weibo.com

UTOOLS1583039841905.png

UTOOLS1583039871927.png

编写Item

修改weiboHotSearch中的items.py,添加item

import scrapy


class WeibohotsearchItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass
    keyword = scrapy.Field()
    url = scrapy.Field()
    count = scrapy.Field()

编写爬虫

  1. 修改start_urls,注意为list格式

  2. 使用xpath解析数据

    xpath语法可参考https://www.w3school.com.cn/xpath/xpath_syntax.asp

    解析数据时,可运行scrapy shell "https://s.weibo.com/top/summary"调试xpath

    UTOOLS1583042532863.png

  3. 引入Item,将数据以Itme对象返回

  4. 执行scrapy crawl weibo运行爬虫

    运行结果如下:

    UTOOLS1583042613274.png

    weibo.py的完整代码

import scrapy

from weiboHotSearch.items import WeibohotsearchItem
class WeiboSpider(scrapy.Spider):
    name = 'weibo'
    allowed_domains = ['s.weibo.com']
    start_urls = ['https://s.weibo.com/top/summary']

    def parse(self, response):
        pass
        for i in response.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[2]'):
            keyword = i.xpath('a/text()').extract_first()
            url = 'https://s.weibo.com'+i.xpath('a/@href').extract_first()
            count = i.xpath('span/text()').extract_first()
            print(keyword)
            print(count)
            # print(url)
            item = WeibohotsearchItem()
            item['keyword'] = keyword
            item['url'] = url
            item['count'] = count
            yield item

保存数据

使用以下命令即可将数据保存到items.json

scrapy crawl weibo -o items.json
cat items.json

数据如下:

UTOOLS1583043788984.png

UTOOLS1583043848215.png

项目地址

https://gitee.com/yu-se/scrapy-test

参考文档

https://scrapy-chs.readthedocs.io/zh_CN/latest/

标签:xpath,weibo,items,爬取,item,scrapy,com,微博热
来源: https://www.cnblogs.com/lzyuid/p/12403151.html