其他分享
首页 > 其他分享> > 我们只能获取网页标题信息而不是正文吗? (机械化)

我们只能获取网页标题信息而不是正文吗? (机械化)

作者:互联网

如果自上次下载以来未更改页面,我只需要下载该怎么办?
什么是最好的方法?我可以先获取页面的大小,然后比较以确定是否已更改,如果更改了,我要求下载,否则跳过?

我计划使用(python)机械化.

解决方法:

该请求应为HEAD,而不是GET:

9.4 HEAD

The HEAD method is identical to GET
except that the server MUST NOT return
a message-body in the response. The
metainformation contained in the HTTP
headers in response to a HEAD request
SHOULD be identical to the information
sent in response to a GET request.
This method can be used for obtaining
metainformation about the entity
implied by the request without
transferring the entity-body itself.
This method is often used for testing
hypertext links for validity,
accessibility, and recent
modification.

The response to a HEAD request MAY be
cacheable in the sense that the
information contained in the response
MAY be used to update a previously
cached entity from that resource. If
the new field values indicate that the
cached entity differs from the current
entity (as would be indicated by a
change in Content-Length, Content-MD5,
ETag or Last-Modified), then the cache
MUST treat the cache entry as stale.

在这里看到How can I perform a HEAD request with the mechanize library

标签:mechanize,screen-scraping,python
来源: https://codeday.me/bug/20191024/1917946.html