python-避免由于相对网址而导致的错误请求
作者:互联网
我正在尝试使用Scrapy抓取网站,并且我要抓取的每个页面的网址都使用此类相对路径编写:
<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
<a href="../../en/item-to-scrap.html">Link</a>
现在,在我的浏览器中,这些链接有效,并且您到达的URL类似于https://www.domain-name.com/en/item-to-scrap.html(尽管相对路径在层次结构中返回两次而不是一次)
但是我的CrawlSpider无法将这些网址转换为“正确的”网址,而我得到的只是该类错误:
2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request
有没有办法解决这个问题,还是我错过了什么?
这是我的蜘蛛程序的代码,相当基本(基于与“ /en/item-*-scrap.html”匹配的项目网址):
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
解决方法:
基本上来说,scrapy使用http://docs.python.org/2/library/urlparse.html#urlparse.urljoin通过连接scraped的currenturl和url链接来获取下一个URL.如果您加入您提供的网址作为示例,
<!-- on page https://www.domain-name.com/en/somelist.html -->
<a href="../../en/item-to-scrap.html">Link</a>
返回的网址与error scrapy error中提到的网址相同.在python shell中尝试一下.
import urlparse
urlparse.urljoin("https://www.domain-name.com/en/somelist.html","../../en/item-to-scrap.html")
urljoin行为似乎是有效的.参见:http://tools.ietf.org/html/rfc1808.html#section-5.2
如果可能,您可以通过正在爬网的网站吗?
有了这些了解,解决方案可以是,
1)处理网址(删除这两个点并加斜杠).在爬行蜘蛛中生成.基本上覆盖解析或_request_to_folow.
爬行蜘蛛的来源:https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py
2)在downloadmiddleware中操纵url,这可能更干净.您可以在下载中间件的process_request中删除../.
用于下载中间件的文档:http://scrapy.readthedocs.org/en/0.16/topics/downloader-middleware.html
3)使用基本Spider,并返回您要进一步抓取的操纵的url请求
basespider的文档:http://scrapy.readthedocs.org/en/0.16/topics/spiders.html#basespider
请让我知道,如果你有任何问题.
标签:scrapy,python,web-crawler 来源: https://codeday.me/bug/20191122/2060617.html