编程语言
首页 > 编程语言> > 使用Python和Beautiful Soup解析HTML

使用Python和Beautiful Soup解析HTML

作者:互联网

<div class="profile-row clearfix"><div class="profile-row-header">Member Since</div><div class="profile-information">January 2010</div></div>
<div class="profile-row clearfix"><div class="profile-row-header">AIGA Chapter</div><div class="profile-information">Alaska</div></div>
<div class="profile-row clearfix"><div class="profile-row-header">Title</div><div class="profile-information">Owner</div></div>
<div class="profile-row clearfix"><div class="profile-row-header">Company</div><div class="profile-information">Mad Dog Graphx</div></div>

我正在使用Beautiful Soup在HTML代码中达到这一点.我现在想搜索代码,并提取2010年1月,阿拉斯加,所有者和Mad Dog Graph之类的数据.所有这些数据都具有相同的类,但是它们之前具有不同的变量,例如“ Member since”,“ AIGA Chapter”等.我如何搜索“自此以来的会员”,然后获得2010年1月的信息?其他3个字段也是如此?

解决方法:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<div class="profile-row clearfix"><div class="profile-row-header">Member Since</div><div class="profile-information">January 2010</div></div>
... <div class="profile-row clearfix"><div class="profile-row-header">AIGA Chapter</div><div class="profile-information">Alaska</div></div>
... <div class="profile-row clearfix"><div class="profile-row-header">Title</div><div class="profile-information">Owner</div></div>
... <div class="profile-row clearfix"><div class="profile-row-header">Company</div><div class="profile-information">Mad Dog Graphx</div></div>
... ''')
>>> for row in soup.findAll('div', {'class':'profile-row clearfix'}):
...  field, value = row.findAll(text = True)
...  print field, value
... 
Member Since January 2010
AIGA Chapter Alaska
Title Owner
Company Mad Dog Graphx

当然,您可以使用字段和值执行任何操作,例如使用它们创建字典或将它们存储在数据库中.

如果“ profile-row clearfix” div中还有其他div或其他文本节点,则需要执行以下操作:field = row.find(‘div’,{‘class’:’profile-row-header’} ).findAll(text = True)等.

标签:beautifulsoup,web-scraping,find,html,python
来源: https://codeday.me/bug/20191208/2089781.html