首页 > 编程语言> > 如何使用python更容易的抓取网页源代码？selenium抓取网页源代码教程

如何使用python更容易的抓取网页源代码？selenium抓取网页源代码教程

2023-02-04 19:08:46 作者：互联网

在审查网站的时候，经常需要获取网页的源代码，这也是python自动化测试的一个重要环节。icode9小编将探讨如何使用 Selenium WebDriver 获取页面源，并进行演示 Selenium 如何在使用 Python 的同时获取 XML 页面源。一起来看看吧。

检索受审查网站的页面源是大多数测试自动化工程师的日常任务。页面源分析有助于消除在常规网站测试、功能测试或安全测试演练中发现的错误。在广泛复杂的应用程序测试过程中，可以编写自动化测试脚本，如果在程序中检测到错误，那么它会自动：
保存该特定页面的源代码。
通知负责页面 URL 的人员。
提取特定元素或代码块的 HTML 源代码，如果错误发生在一个特定的独立 HTML Web 元素或代码块中，则将其委托给负责机构。
这是跟踪和修复前端代码中的逻辑和语法错误的简便方法。在本文中，我们首先了解涉及的术语，并探讨如何使用 Python 获取 Selenium WebDriver 中的页面源。
什么是 HTML 页面源代码？
在非技术术语中，它是一组指令，供浏览器以美观的方式在屏幕上显示信息。浏览器以自己的方式解释这些指令，为客户端创建浏览器屏幕。这些通常是使用超文本标记语言 (HTML)、层叠样式表 (CSS) 和 Javascript 编写的。
制作网页的整套 HTML 指令称为页面源、HTML 源或简称为源代码。网站源代码是来自各个网页的源代码的集合。
下面是一个包含标题、表单、图像和提交按钮的基本页面的源代码示例。
<!DOCTYPE html> <html> <head> <title>Page Source Example - LambdaTest</title> </head> <body> <h2>Debug selenium testing results : LambdaTest</h2> <img loading="lazy" data-fr-src="https://cdn.lambdatest.com/assetsnew/images/debug-selenium-testing-results.jpg" alt="debug selenium testing" width="550" height="500"><br><br> <form action="/"> <label for="debug">Do you debug test results using LambdaTest?</label><br> <input type="text" id="debug" name="debug" value="Of-course!"><br> <br> <input type="submit" value="Submit"> </form> <br><br> <button type="button" onclick="alert('Page Source Example : LambdaTest!')">Click Me!</button> </body> </html>

什么是 HTML Web 元素？
描述 HTML 网络元素的最简单方法是，“构成 HTML 页面源代码的任何 HTML 标记都是网络元素。” 它可以是一个 HTML 代码块，一个独立的 HTML 标签，如</br>，网页上的一个媒体对象——图像、音频、视频、一个 JS 函数，或者一个包裹在<script> </script>标签中的 JSON 对象。
在上面的例子中，<title>是一个 HTML 网页元素，body 标签的子元素也是 HTML 网页元素，即 ,<img>等<button>。
如何使用 Python 在 Selenium WebDriver 中获取页面源
Selenium WebDriver 是一个强大的自动化测试工具，为自动化测试工程师提供了一组多样化的随时可用的 API。为了使 Selenium WebDriver 获取页面源代码，Selenium Python 绑定为我们提供了一个驱动程序函数page_source，用于获取浏览器中当前活动 URL 的 HTML 源代码。
或者，我们也可以使用GETPython的request库的函数来加载页面源码。另一种方法是使用驱动程序函数执行 JavaScript execute_script，并使 Selenium WebDriver 在 Python 中获取页面源。一种不推荐的获取页面源代码的方法是将 XPath 与“view-source:”URL 结合使用。让我们探索如何使用 Python 在 Selenium WebDriver 中获取页面源的这四种方法的示例。
对于所有四个示例，我们将使用GitHub 上托管的示例小网页。创建此页面是为了演示使用 LambdaTest 在 Selenium Python 中进行拖放测试。
使用 driver.page_source 获取 HTML 页面源
我们将获取pynishant.github.ioChromeDriver 并将其内容保存到名为page_source.html. 该文件名可以是您选择的任何名称。接下来，我们读取文件的内容并在关闭驱动程序之前将其打印在终端上：
from selenium import webdriver driver = webdriver.Chrome() driver.maximize_window() driver.get("https://pynishant.github.io/") pageSource = driver.page_source fileToWrite = open("page_source.html", "w") fileToWrite.write(pageSource) fileToWrite.close() fileToRead = open("page_source.html", "r") print(fileToRead.read()) fileToRead.close() driver.quit()

成功执行上述脚本后，您的终端输出将显示以下页面源：
使用 driver.execute_javascript 获取 HTML 页面源
在前面的示例中，我们必须注释掉（或替换）该driver.page_source行并添加以下行：driver.execute_scriptis a Selenium Python WebDriver API to execute JS in a Selenium environment. 在这里，我们执行一个返回 HTML 正文元素的 JS 脚本。
# pageSource = driver.page_source pageSource = driver.execute_script("return document.body.innerHTML;")

输出代码如下所示：
如您所见，它只返回innerHTMLbody 元素的。和上一个输出一样，我们没有得到整个页面的源代码。要获取整个文档，我们执行document.documentElement.outerHTML. 该execute_script行现在看起来像这样：
pageSource = driver.execute_script("return document.documentElement.outerHTML;")

这准确地为我们提供了使用driver.page_source.
在 Selenium WebDriver 中使用 Python 的请求库获取页面源
此方法与 Selenium 无关，但您可以查看“What Is Selenium?” article，这是一种获取网页源的纯Pythonic方式。在这里，我们使用Python 的请求库向URL 发出get 请求，并将请求的响应（即页面源）保存到HTML 文件并在终端上打印。
这是脚本：
import requests url = 'https://pynishant.github.io/' pythonResponse = requests.get(url) fileToWrite = open("py_source.html", "w") fileToWrite.write(pythonResponse.text) fileToWrite.close() fileToRead = open("py_source.html", "r") print(fileToRead.read()) fileToRead.close()

该方法可用于在Selenium控制的浏览器无需加载页面的情况下快速存储网页源代码。同样，我们可以使用 urllib Python 库来获取 HTML 页面源。
使用“view-source”URL 获取 HTML 页面源代码
这很少需要，但您可以附加目标 URLview-source并将其加载到浏览器窗口中以加载源代码并将其保存在手动测试中：
以编程方式，要在 Python Selenium 中获取屏幕截图的源代码（如果需要），您可以使用以下方式加载页面：
driver.get("view-source:https://pynishant.github.io/")

使用 XPath 在 Selenium Python WebDriver 中获取 HTML 页面源
第四种使Selenium WebDriver 获取页面源的方法是使用XPath 来保存它。在这里，我们不page_source执行 JavaScript，而是识别源元素，即<html>提取它。将之前的页面取源逻辑注释掉，替换成如下内容：
# pageSource = driver.page_source pageSource = driver.find_element_by_xpath("//*").get_attribute("outerHTML")

在上面的脚本中，我们使用驱动程序方法find_element_by_xpath来定位网页的 HTML 元素。我们使用 source nod: 输入文档"//*"，并获取其“外部 HTML”，即文档本身。输出看起来与我们之前使用driver.page_source.
如何在 Selenium 中检索 WebElement 的 HTML 源代码
要在 Selenium WebDriver 中获取 WebElement 的 HTML 源，我们可以使用Selenium Python WebDriverget_attribute的方法。首先，我们使用像 () 这样的驱动元素定位器方法获取 HTML WebElement 。接下来，我们将方法应用到这个抓取的元素上以获取它的 HTML 源代码。find_element_by_xpath or find_element_by_css_selectorget_attribute()
假设，从pynishant.github.io，我们想要获取并打印 ID 为“div1”的 div 的源代码。代码如下所示：
from selenium import webdriver driver = webdriver.Chrome() driver.maximize_window() driver.get("https://pynishant.github.io/") elementSource = driver.find_element_by_id("div1").get_attribute("outerHTML") print(elementSource) driver.quit()

这是输出：
同样，要获取innerHTMLWebElement 的子项或：
driver.find_element_by_id("some_id_or_selector").get_attribute("innerHTML")

有另一种方法可以做到这一点并获得相同的结果：
elementSource = driver.find_element_by_id("id_selector_as_per_requirement") driver.execute_script("return arguments[0].innerHTML;", elementSource)

如何在 Python Selenium WebDriver 中从 HTML 页面源中检索 JSON 数据
现代应用程序是使用多个 API 构建的。通常，这些 API 会动态更改 HTML 元素的内容。JSON 对象已经成为 XML 响应类型的替代品。因此，专业的 Selenium Python 测试人员必须处理 JSON 对象，尤其是嵌入在<script>HTML 标记中的对象。Python 为我们提供了一个内置的 JSON 库来试验 JSON 对象。
为了举例说明，我们在 Selenium 驱动程序中加载“https://www.cntraveller.in/”，并查找其中包含的 SEO 模式，<script type=”application/ld+json”> </script>以验证徽标 URL 是否包含在“JSON”模式中。顺便说一句，如果您感到困惑，这个“SEO 模式”对于让网页在谷歌上排名很有用。它与代码逻辑或测试无关。我们仅将其用于演示。
我们将在此演示中使用 LambdaTest：
from selenium import webdriver import json import re username = "hustlewiz247" accessToken = "1BtTGpkzkYeOKJiUdivkWxvmHQppbahpev3DpcSfV460bXq0GC" gridUrl = "hub.lambdatest.com/wd/hub" desired_cap = { 'platform' : "win10", 'browserName' : "chrome", 'version' : "71.0", "resolution": "1024x768", "name": "LambdaTest json object test ", "build": "LambdaTest json object test", "network": True, "video": True, "visual": True, "console": True, } url = "https://"+username+":"+accessToken+"@"+gridUrl print("Initiating remote driver on platform: "+desired_cap["platform"]+" browser: "+desired_cap["browserName"]+" version: "+desired_cap["version"]) driver = webdriver.Remote( desired_capabilities=desired_cap, command_executor= url ) # driver = webdriver.Chrome() driver.maximize_window() driver.get("https://www.cntraveller.in/") jsonSource = driver.find_element_by_xpath("//script[contains(text(),'logo') and contains(@type, 'json')]").get_attribute('text') jsonSource = re.sub(";","",jsonSource) jsonSource = json.loads(jsonSource) if "logo" in jsonSource: print("\n logoURL : " + str(jsonSource["logo"])) else: print("JSON Schema has no logo url.") try: if "telephone" in jsonSource: print(jsonSource["telephone"]) else: print("No Telephone - here is the source code :\n") print(driver.find_element_by_xpath("//script[contains(text(),'logo') and contains(@type, 'json')]").get_attribute('outerHTML')) except Exception as e: print(e) driver.quit()

输出包含logoURL和 webElement 来源：
代码分解
以下三行导入所需的库：Selenium WebDriver、Python 的 JSON 和 re 库来处理 JSON 对象和使用正则表达式：
from selenium import webdriver import json import re

接下来，我们配置我们的脚本以在 LambdaTest 的云上成功运行它。我只用了不到三十秒就开始了（可能是因为我之前有使用该平台的经验）。但即使您是初学者，也只需不到一分钟的时间。在LambdaTest官网注册，使用Google登录，点击“ Profile ”复制用户名和access token：
username = "your_username_on_lambdaTest" accessToken = "your lambdaTest access token" gridUrl = "hub.lambdatest.com/wd/hub" desired_cap = { 'platform' : "win10", 'browserName' : "chrome", 'version' : "71.0", "resolution": "1024x768", "name": "LambdaTest json object test ", "build": "LambdaTest json object test", "network": True, "video": True, "visual": True, "console": True, } url = "https://"+username+":"+accessToken+"@"+gridUrl

我们以全屏模式启动驱动程序并使用以下代码行加载 cntraveller 主页：
driver = webdriver.Remote( desired_capabilities=desired_cap, command_executor= url ) # driver = webdriver.Chrome() driver.maximize_window() driver.get("https://www.cntraveller.in/")

现在，我们使用XPath 定位器定位包含脚本的 JSON 对象，并删除不必要的分号以正确加载 JSON 格式的字符串：
jsonSource = driver.find_element_by_xpath("//script[contains(text(),'logo') and contains(@type, 'json')]").get_attribute('text') jsonSource = re.sub(";","",jsonSource) jsonSource = json.loads(jsonSource)

然后，我们检查徽标 URL 是否存在。如果存在，我们打印它：
if "logo" in jsonSource: print("\n logoURL : " + str(jsonSource["logo"])) else: print("JSON Schema has no logo url.")

此外，我们还会检查电话详细信息是否存在。如果没有，我们打印 WebElement 的源代码：
try: if "telephone" in jsonSource: print(jsonSource["telephone"]) else: print("No Telephone - here is the source code :\n") print(driver.find_element_by_xpath("//script[contains(text(),'logo') and contains(@type, 'json')]").get_attribute('outerHTML')) except Exception as e: print(e)

最后，我们退出驱动程序：
driver.quit()
如何在 Selenium WebDriver 中将页面源作为 XML 获取
如果您正在加载 XML 呈现的网站，您可能希望保存 XML 响应。这是使 Selenium 获取 XML 页面源的有效解决方案：
drive.execute_script(‘return document.getElementById(“webkit-xml-viewer-source-xml”).innerHTML’)

结论
您可以使用上述任何一种方法并利用 LambdaTest Selenium Grid 云的敏捷性和可扩展性来自动化您的测试流程。它允许您在 3000 多种浏览器、操作系统及其版本上执行测试用例。此外，您可以将自动化测试流程与现代 CI/CD 工具集成，并遵循最佳的连续测试实践。

标签：python爬虫,selenium,获取网页源代码,自动化测试
来源：