如何使用Urllib2更有效地抓取?
作者:互联网
新手在这里.我使用urllib2编写了一个简单的脚本,以浏览Billboard.com,并从1958年到2013年的每一周抓取最佳歌曲和歌手.问题是它非常缓慢-可能要花几个小时才能完成.
我想知道瓶颈在哪里,是否有办法更有效地利用Urllib2进行刮擦,或者是否需要使用更复杂的工具?
import re
import urllib2
array = []
url = 'http://www.billboard.com/charts/1958-08-09/hot-100'
date = ""
while date != '2013-07-13':
response = urllib2.urlopen(url)
htmlText = response.read()
date = re.findall('\d\d\d\d-\d\d-\d\d',url)[0]
song = re.findall('<h1>.*</h1>', htmlText)[0]
song = song[4:-5]
artist = re.findall('/artist.*</a>', htmlText)[1]
artist = re.findall('>.*<', artist)[0]
artist = artist[1:-1]
nextWeek = re.findall('href.*>Next', htmlText)[0]
nextWeek = nextWeek[5:-5]
array.append([date, song, artist])
url = 'http://www.billboard.com' + nextWeek
print array
解决方法:
这是一个使用Scrapy的解决方案.看一下overview,您将了解它是为此类任务设计的工具:
>快速(基于扭曲)
>易于使用和理解
>基于xpath的内置提取机制(尽管您也可以使用bs或lxml)
>内置支持将提取的项目流水线传输到数据库,xml,json等
>以及更多功能
这是一个工作蜘蛛,它提取了您要询问的所有内容(在我比较旧的笔记本电脑上工作了15分钟):
import datetime
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class BillBoardItem(Item):
date = Field()
song = Field()
artist = Field()
BASE_URL = "http://www.billboard.com/charts/%s/hot-100"
class BillBoardSpider(BaseSpider):
name = "billboard_spider"
allowed_domains = ["billboard.com"]
def __init__(self):
date = datetime.date(year=1958, month=8, day=9)
self.start_urls = []
while True:
if date.year >= 2013:
break
self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d'))
date += datetime.timedelta(days=7)
def parse(self, response):
hxs = HtmlXPathSelector(response)
date = hxs.select('//span[@class="chart_date"]/text()').extract()[0]
songs = hxs.select('//div[@class="listing chart_listing"]/article')
for song in songs:
item = BillBoardItem()
item['date'] = date
try:
item['song'] = song.select('.//header/h1/text()').extract()[0]
item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0]
except:
continue
yield item
将其保存到billboard.py中,然后通过scrapy runpider billboard.py -o output.json运行.然后,在output.json中,您将看到:
...
{"date": "September 20, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"}
{"date": "September 20, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"}
{"date": "September 20, 1958", "artist": "The Elegants", "song": "Little Star"}
{"date": "September 20, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"}
{"date": "September 20, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"}
{"date": "September 20, 1958", "artist": "Poni-Tails", "song": "Born Too Late"}
{"date": "September 20, 1958", "artist": "The Olympics", "song": "Western Movies"}
{"date": "September 20, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"}
{"date": "September 20, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"}
{"date": "September 27, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"}
{"date": "September 27, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"}
{"date": "September 27, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"}
{"date": "September 27, 1958", "artist": "The Elegants", "song": "Little Star"}
{"date": "September 27, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"}
{"date": "September 27, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"}
{"date": "September 27, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"}
...
另外,以grequests作为替代工具.
希望能有所帮助.
标签:web-scraping,html-parsing,html,python,regex 来源: https://codeday.me/bug/20191123/2064432.html