首页 > 编程语言> > python-Scrapy爬网所有站点地图链接

python-Scrapy爬网所有站点地图链接

2019-10-10 06:00:05 作者：互联网

我想抓取他在固定站点的sitemap.xml中存在的所有链接.我遇到了Scrapy的SitemapSpider.到目前为止,我已经提取了站点地图中的所有网址.现在,我想通过站点地图的每个链接进行爬网.任何帮助将非常有用.到目前为止的代码是：

class MySpider(SitemapSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 

    def parse(self, response):
        print response.url

解决方法:

您需要添加sitemap_rules来处理抓取的url中的数据,并且可以创建任意数量的数据.
例如,假设您有一个名为http://www.xyz.nl//x/的页面要创建规则：

class MySpider(SitemapSpider):
    name = 'xyz'
    sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
    # list with tuples - this example contains one page 
    sitemap_rules = [('/x/', parse_x)]

    def parse_x(self, response):
        sel = Selector(response)
        paragraph = sel.xpath('//p').extract()

        return paragraph

标签：sitemap,python,scrapy,web-crawler
来源： https://codeday.me/bug/20191010/1884552.html