首页 > 其他分享> > 【云云怪】第6个项目：爬百度资讯

【云云怪】第6个项目：爬百度资讯

2021-05-24 12:58:52 作者：互联网

（先声明：创建这个项目的时候，百度的robots协议只禁止了taobao，所以我的爬虫是合法的。不过现在百度的robots又改了，所以这篇就不附上完整代码了）

【项目预览】

【创作背景】

学了爬虫之后，先捣鼓了一个爬今日头条的程序，然后我老公说，今日头条这种新晋暴发户太low了，百度才是中国互联网界的资深贵族，去做个爬百度新闻的程序吧。

【过程分析】

1，爬哪个网页？直接打开百度，输入关键词搜索，会进入“网页”。网页里面的信息太繁杂，有百科，有新闻，有广告，有贴吧，有音乐。。。实在不利于一只新闻爬虫工作。于是，我选择直接爬取“资讯”页面。

2，新闻的时效性：搜索新闻通常希望是有时效性的，比如我只想看1天以内的新闻。百度是咨询列出了新闻发布的时间，因此用datetime能够算出时效。

3，新闻的质量：在今日头条的项目里，我还做了个“评论数”筛选器，筛掉哪些很少评论（在我看来就意味着滥竽充数）的新闻，以获取高质量的精选新闻。但百度资讯没有很好的展示评论数，因此这个功能暂时只能放弃。

4，筛掉重复新闻：爬过一次才知道，百度搜出来的重复新闻太多太多，各家新闻网站一大抄，有的甚至名字都懒得换。我只能设置了一个“名字池”，每条新闻的名字先进“名字池”对比一下，没有重复的再显示和储存。

【完工体验】

最深的感受是，原来今日头条也不是很low啊。跟百度比起来，今日头条的新闻少而精，且列出了评论数，让我很容易判断出哪些是热点。百度搜索大而全，然而过于庞杂的信息量，反而让人找不到真正想看的东西。

【不完整代码】（省略了爬取和解析语句）

from openpyxl import Workbook,load_workbook
from bs4 import BeautifulSoup
import requests,datetime,time

def daydiff(day1, day2):
    time_array1 = time.strptime(day1, "%Y-%m-%d")
    timestamp_day1 = int(time.mktime(time_array1))
    time_array2 = time.strptime(day2, "%Y-%m-%d")
    timestamp_day2 = int(time.mktime(time_array2))
    result = (timestamp_day2 - timestamp_day1) // 60 // 60 // 24
    if result<0:
        result=-result
    return result

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
url='https://www.baidu.com/s'

print('\n本程序可爬取百度资讯（不是网页），生成资讯标题和链接的记录文件。\n')

while True:
    key=input('\n输入搜索关键词（多个关键词用空格分开）：')
    days=int(input('\n输入新闻时效天数（例：今天请填0）：'))
    item=int(input('\n输入搜索新闻条数（支持10-1000条）：'))

    filename=key+'-百度资讯.xlsx'
    wb=Workbook()
    ws=wb.active
    ws.append(['新闻标题','来源网站','发布时间','链接','摘要'])
    titlepoor=[]
    count=0
    count0=0

    for x in range(0,item,10):
        page=round(x/10)
        params={
            'rtt': 1,
            'bsst': 1,
            'cl': 2,
            'tn': 'news',
            'ie':'utf-8',
            'wd': key,
            'pn':x,
            }
        try:
            #爬取和解析语句（省略拉~）
            res=xxxxxx
            soup=xxxxxx

            for i in range(1,11):
                count0+=1
                contan=soup.find(id='content_left')
                idd=contan.find(id=i+x)
                titletag=idd.find('h3')
                href=titletag.find('a')['href']
                title=titletag.find('a').text
                source=idd.find(class_="news-source").find(class_="c-color-gray").text.strip()
                time0=idd.find(class_="news-source").find(class_="c-color-gray2").text.strip()
                zhaiyao=idd.find(class_="c-color-text").text.strip()               
                
                if '分钟' in time0 or '小时' in time0:
                    day=0
                elif '昨天' in time0:
                    day=1
                elif '前天' in time0:
                    day=2
                elif '天前' in time0:
                    day=int(time0[:-2])
                else:
                    try:
                        today=str(datetime.datetime.now())[:10]
                        kuaizhao=idd.find('div',class_="c-span-last")
                        kuaizhaol=kuaizhao.find('a')['href']
                        
                        #爬快照并获取发布时间（省略拉~）
                        res1=xxxxxxxx
                        res1.encoding=xxxxxx
                        html1=xxxxxx
                        soup1=xxxxxxx

                        span=soup1.find(id="bd_snap_txt").find_all('span')
                        timee=(span[1].text)[12:24]
                        timeee=timee[:4]+'-'+timee[5:7]+'-'+timee[8:10]

                        day=daydiff(today,timeee)
                    except:
                        day=101
                    
                if days>=day and title not in titlepoor:
                    titlepoor.append(title)
                    count+=1        
                    ws.append([title,source,time0,href,zhaiyao])

                    print('\n'+title)
                    print(time0)

        except:
            print('\n加载到第{}页，已经没有更多内容。'.format(page))
            break

    if count>0:
        save=input('\n搜索了{}条新闻，找到{}条符合条件的。是否保存到本地？按1保存，按其他键放弃：'.format(count0,count))
        if save=='1':
            print('\n保存完毕。')
            wb.save(filename)
    else:
        print('\n哦豁，一条符合条件的新闻都没有找到。请调整搜索条件，重新搜索。')
        
    print('\n'+'-'*70)
    again=input('\n继续搜索请按回车，退出程序请按0：')
    if again=='0':
        input('\n谢谢使用，再见。')
        break

标签：云云,资讯,新闻,搜索,time,print,百度,头条
来源： https://blog.csdn.net/weixin_57719910/article/details/117221157