首页 > 其他分享> > 爬虫_scrapy_新浪微博

爬虫_scrapy_新浪微博

2022-07-26 12:33:01 作者：互联网

1.创建项目

在指定文件目录下进入cmd窗口，执行创建scrapy项目命令：

scrapy startproject scrapy_xinlangweibo

如图：

2.创建爬虫文件

进入spiders文件目录下，执行创建爬虫文件命令：

scrapy genspider weibo www.weibo.com

如图：

3.修改robotstxt协议

在setting.py中修改

4.测试运行

scrapy crawl weibo

测试通过，接着往下写代码！

5.xpath解析

当我们利用xpath插件能够解析，但是在python中却返回空

这时候不要慌，这说明请的网站有了反扒手段，我们需要设置UA，如果UA不行再设置cookie。就是这样博弈的。

在setting中，进行设置

先设置UA

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'
}

再次运行查看效果。

还是不行，根据提示应该是需要登录，那么就是说我们在爬虫中需要添加cookie。登录微博，获取cookie然后放入爬虫中。

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'cookie': 'SUB=_2A25PiFDGDeRhGeRH7lYR9C_LzTmIHXVtc3COrDV8PUJbkNAKLRH2kW1NTbHM8wP7UPnTjvTAmox62rVYXbj0cyiW; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WFFUpSiA2MzUQrY0aSxSJlG5JpX5oz75NHD95QE1K-XehBpS0qfWs4Dqcj1i--Xi-iFiKnpehnp9sMt; SINAGLOBAL=4669199551025.64.1653350613001; PC_TOKEN=87ef8d1632; _s_tentry=weibo.com; Apache=5712738189316.748.1658709963128; ULV=1658709963132:2:1:1:5712738189316.748.1658709963128:1653350613062',
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'
}

每个人的cookie不一样。这里替换成自己的就行了，再次运行查看。

不出意外是应该可以获取到的，哈哈哈

当看到这一大堆文字时候，又向成功迈进一小步。查看我们解析的测试数据。

博主名称获取到了！之后就是解析需要的数据了，需要xpath的语法，这里我就不赘述了，不了解的小伙伴可以查看我之前的文章，跟着一步步来即可，这里直接上解析后的代码：

import scrapy
from datetime import datetime
from scrapy_xinlangweibo.items import ScrapyXinlangweiboItem

class WeiboSpider(scrapy.Spider):
    # 爬虫的名字 用于运行爬虫的时候使用的值（运行命令 scrapy crawl weibo）
    name = 'weibo'
    # 允许访问的域名
    allowed_domains = ['www.weibo.com']
    base_url = 'https://s.weibo.com/weibo?q={}'.format('农业')
    # 起始的url 指的是第一次要访问的域名'https://s.weibo.com/weibo?q=农业'
    start_urls = [base_url]
    page = 1

    def parse(self, response):
        print('~~~~~~~~~~小糖丸开始蠕动了，咕叽咕叽~~~~~~~~~~')
        #UA反爬-测试
        #print(response.request.headers['User-Agent'])
        #cookie反爬-测试
        #print(response.request.headers['cookie'])
        #原文-验证
        #print(response.text)

        # 微博主结构
        weibo_list=response.xpath('//div[@action-type="feed_list_item" and contains(@mid,\'47951\')]')
        for wb in weibo_list:
            # 博主名称
            name = wb.xpath('.//div[@class="card-feed"]/div[@class="content"]/p[@node-type="feed_list_content"]/@nick-name').extract_first()
            # 发布时间
            time = wb.xpath('.//div[@class="card-feed"]/div[@class="content"]/p[@class="from"]/a[@target="_blank"]/text()').extract_first()
            # 来源
            source = wb.xpath('.//div[@class="card-feed"]/div[@class="content"]/p[@class="from"]/a[@rel="nofollow"]/text()').extract_first()
            # 博文
            txtExtract = wb.xpath('.//div[@class="card-feed"]/div[@class="content"]/p[3]/text()').extract()
            txt=''
            #拼接博文
            for string in txtExtract:
                txt+=string.replace('\n','').strip() #去掉\n和空格
            #print('博文》》》'+txt)
            # 转发
            forward = wb.xpath('.//div[@class="card-act"]/ul/li/a[@action-type="feed_list_forward"]').extract_first().split('</span>')[1].replace('</a>','').replace('转发','0')
            print('forward>>>'+forward)

            # 评论
            comment = wb.xpath('.//div[@class="card-act"]//a[@action-type="feed_list_comment"]').extract_first().split('</span>')[1].replace('</a>','').replace('评论','0')
            # 点赞
            fabulous = wb.xpath('.//div[@class="card-act"]//a[@action-type="feed_list_like"]/button/span[2]/text()').extract_first().replace('赞','0')
            #采集时间
            createTime = datetime.now()

            # 提交管道
            wb = ScrapyXinlangweiboItem(name=name,time=time,source=source,txt=txt,forward=forward,comment=comment,fabulous=fabulous,createTime=createTime)
            yield wb

        # 每一页爬取的业务逻辑都是一样的，所以我们只需要将执行的那个页的请求再次调用parse方法就可以了
        if self.page < 20:
            self.page = self.page + 1
            url = self.base_url+'&page=' + str(self.page)
            print('第二次url>>>'+url)
            # 调用parse方法
            # scrapy.Request就是scrapy的get请求
            yield scrapy.Request(url=url, callback=self.parse,dont_filter=True)

这里的xpath需要多次找规律，费时间的就在此

6.开通下载管道

在setting中解除注释

ITEM_PIPELINES = {
   'scrapy_xinlangweibo.pipelines.ScrapyXinlangweiboPipeline': 300,
}

7.创建数据结构目录便于持久化操作

在items.py文件中定义数据结构

# 新浪微博数据结构
class ScrapyXinlangweiboItem(scrapy.Item):
    # define the fields for your item here like:
    # 博主名称
    name = scrapy.Field()
    # 发微时间
    time = scrapy.Field()
    # 来自
    source = scrapy.Field()
    #博文
    txt =scrapy.Field()
    pass

8.利用管道提交解析数据给数据结构

需要导入数据结构文件

from scrapy_xinlangweibo.items import ScrapyXinlangweiboItem

9.配置数据库

在setting中配置持久化数据库信息

#配置数据库,名称一定要大写
DB_HOST = 'ip'
DB_PORT = 3306
DB_USER = 'root'
DB_PASSWORD = '密码'
DB_NAME = '数据库'
# utf-8的“-”杠不允许写，否则就报错
DB_CHARSET = 'utf8'

10.管道中编写插入数据库脚本

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class ScrapyXinlangweiboPipeline:
    def process_item(self, item, spider):
        return item

#加载settings文件
from scrapy.utils.project import get_project_settings
#导入pymysql
import pymysql
# 管道-Mysql持久化
class MysqlPipeline:
    def open_spider(self,spider):
        settings = get_project_settings()
        self.host = settings['DB_HOST']
        self.port = settings['DB_PORT']
        self.user = settings['DB_USER']
        self.password = settings['DB_PASSWORD']
        self.database = settings['DB_NAME']
        self.charset = settings['DB_CHARSET']
        self.connect()

    def connect(self):
        self.conn =pymysql.connect(
                            host=self.host,
                            port=self.port,
                            user=self.user,
                            password=self.password,
                            db=self.database,
                            charset=self.charset
        )

        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        #插入数据库
        sql = 'insert into xinlangweibo(name,time,source,txt,forward,comment,fabulous,createTime) values("{}","{}","{}","{}","{}","{}","{}","{}")'.format(
            item['name'],item['time'].strip(),item['source'],item['txt'],item['forward'],item['comment'],item['fabulous'],item['createTime'])
        # 执行sql语句
        self.cursor.execute(sql)
        # 提交
        self.conn.commit()
        return item

    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

标签：weibo,self,微博,爬虫,item,scrapy,div,class
来源： https://www.cnblogs.com/ckfuture/p/16517956.html