其他分享
首页 > 其他分享> > 轻轻学爬虫—scrapy框架巧用8—猴子偷桃(4)

轻轻学爬虫—scrapy框架巧用8—猴子偷桃(4)

作者:互联网

## 轻轻学爬虫—scrapy框架巧用8—猴子偷桃(4) 本节课我们来学习bs4库中的常用方法,还是以下面的数据为例子 ```python html_doc = """ The Dormouse's story <body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') soup.prettify() print(soup) #得到下面结构化的html ""“ The Dormouse's story <body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie , Lacie and Tillie ; and they lived at the bottom of a well.

...

""" ``` ## 过滤器 find_all()查找所有标签以列表形式返回 ### 字符串 ```python print(soup.find_all('b')) # [The Dormouse's story] ``` ### 正则表达式 正则的部分我们抽空讲解。先知道可以这样写就可以 ```python import re for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b ``` ### 列表 如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回 ```python print(soup.find_all(["a", "b"])) # [The Dormouse's story, Elsie, Lacie, Tillie] ``` ### True `True` 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点 ```python for tag in soup.find_all(True): print(tag.name) # html # head # title # body # p # b # p # a # a # a # p ``` ### 方法(函数) 如果没有合适过滤器,那么还可以定义一个方法 ```python def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id') soup.find_all(has_class_but_no_id) #[

The Dormouse's story

,

Once upon a time there were three #little sisters; and their names were #Elsie, #Lacie and #Tillie; #and they lived at the bottom of a well.

,

...

] ``` find_all()内容比较多,小伙伴们可以先理解一下。后续我们接着来 码字不易,欢迎大家在评论区留言,收藏。或者加入群聊[群聊](https://jq.qq.com/?_wv=1027&k=vH00muGu)一起进步学习。

标签:story,偷桃,python,爬虫,Dormouse,soup,scrapy,tag,find
来源: https://blog.51cto.com/u_15241290/3019064