首页 > 其他分享> > 爬取牛客题目及对应的题号等信息

爬取牛客题目及对应的题号等信息

2020-05-30 21:54:44 作者：互联网

　　这个实例和上一个实例差不多，首先我们来到题目列表，观察一下链接可以发现，对于不同页的题目来说，链接只有page变量有所改变，第一页为1，第二页为2等。那么我们可以通过改变page后的值来获取不同页的内容。我们观察一下每一页内容可以发现，题目信息都在一个td标签内，那我们就可以找出所有的td标签，取出其中的字符串，去掉空字符串。然后我们可以发现每五个字符串对应一个题目的信息，所以我们每五个作为一组进行处理，即可得到每一个题目的信息。

　　代码如下：

import requests
from bs4 import BeautifulSoup

def getHTMLText(url):
    try:
        headers = {'User-Agent': 'Mozilla/5.0'}
        response = requests.get(url, headers=headers)
        response.encoding = response.apparent_encoding
        return response.text
    except:
        return ''

def parseHtml(html):
    soup = BeautifulSoup(html, 'html.parser')
    tds = soup.find_all('td')
    infoList = []
    info = []
    cnt = 0
    for td in tds:
        txt = td.get_text()
        txt = txt.split('\n')
        temp = []
        for text in txt:
            if text == '':
                continue
            temp.append(text)
        if temp == []:
            continue
        cnt += 1
        for t in temp:
            info.append(t)
        if cnt == 5:
            cnt = 0
            infoList.append(info)
            info = []
    return infoList

def main():
    base_url = 'https://ac.nowcoder.com/acm/problem/list?keyword=&tagId=0&platformTagId=0&sourceTagId=0&difficulty=0&status=all&order=id&asc=true&pageSize=50&page='
    infoList = []
    for i in range(1, 51):
        url = base_url + str(i)
        html = getHTMLText(url)
        infoList += parseHtml(html)
    for info in infoList:
        print(info)

main()

标签：info,infoList,url,text,爬取,牛客,html,td,题号
来源： https://www.cnblogs.com/HighLights/p/12995175.html