编程语言
首页 > 编程语言> > python-避免由于相对网址而导致的错误请求

python-避免由于相对网址而导致的错误请求

作者:互联网

我正在尝试使用Scrapy抓取网站,并且我要抓取的每个页面的网址都使用此类相对路径编写:

<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
<a href="../../en/item-to-scrap.html">Link</a>

现在,在我的浏览器中,这些链接有效,并且您到达的URL类似于https://www.domain-name.com/en/item-to-scrap.html(尽管相对路径在层次结构中返回两次而不是一次)

但是我的CrawlSpider无法将这些网址转换为“正确的”网址,而我得到的只是该类错误:

2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request

有没有办法解决这个问题,还是我错过了什么?

这是我的蜘蛛程序的代码,相当基本(基于与“ /en/item-*-scrap.html”匹配的项目网址):

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class Product(Item):
    name = Field()

class siteSpider(CrawlSpider):
    name = "domain-name.com"
    allowed_domains = ['www.domain-name.com']
    start_urls = ["https://www.domain-name.com/en/"]
    rules = (
        Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
        Rule(SgmlLinkExtractor(allow=('')), follow=True),
    )

    def parse_item(self, response):
        x = HtmlXPathSelector(response)
        product = Product()
        product['name'] = ''
        name = x.select('//title/text()').extract()
        if type(name) is list:
            for s in name:
                if s != ' ' and s != '':
                    product['name'] = s
                    break
        return product

解决方法:

基本上来说,scrapy使用http://docs.python.org/2/library/urlparse.html#urlparse.urljoin通过连接scraped的currenturl和url链接来获取下一个URL.如果您加入您提供的网址作为示例,

<!-- on page https://www.domain-name.com/en/somelist.html -->
<a href="../../en/item-to-scrap.html">Link</a>

返回的网址与error scrapy error中提到的网址相同.在python shell中尝试一下.

import urlparse 
urlparse.urljoin("https://www.domain-name.com/en/somelist.html","../../en/item-to-scrap.html")

urljoin行为似乎是有效的.参见:http://tools.ietf.org/html/rfc1808.html#section-5.2

如果可能,您可以通过正在爬网的网站吗?

有了这些了解,解决方案可以是,

1)处理网址(删除这两个点并加斜杠).在爬行蜘蛛中生成.基本上覆盖解析或_request_to_folow.

爬行蜘蛛的来源:https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py

2)在downloadmiddleware中操纵url,这可能更干净.您可以在下载中间件的process_request中删除../.

用于下载中间件的文档:http://scrapy.readthedocs.org/en/0.16/topics/downloader-middleware.html

3)使用基本Spider,并返回您要进一步抓取的操纵的url请求

basespider的文档:http://scrapy.readthedocs.org/en/0.16/topics/spiders.html#basespider

请让我知道,如果你有任何问题.

标签:scrapy,python,web-crawler
来源: https://codeday.me/bug/20191122/2060617.html