编程语言
首页 > 编程语言> > Python BeautifulSoup获取文本优先标记

Python BeautifulSoup获取文本优先标记

作者:互联网

我需要使用python中的BeautifulSoup将标签的文本设置为li标签的第一级.

问题是这些标签包含其他li标签,而这些li标签又包含其他标签.

范例html:

<li>
   <a href="http://lol.lol">Text1</a><-- GET THIS
   <li>
      <a href="http://lol.lol">Text1</a><-- DON'T GET THIS
   </li>
</li>
<li>
   <a href="http://lol.lol">Text2</a><-- GET THIS
   <li>
      <a href="http://lol.lol">Text2-2</a><-- DON'T GET THIS
   </li>
</li>

编辑:

我一直在测试,我并没有只获得第一个a标签.

这是我尝试提取的原始内容:

<div id="categories_block_left" class="block block-highlighted">
<h4 class="title_block">
<span class="icon-box fa fa-bars"></span>
RELOJES
</h4>
<div class="block_content" style="">
<ul class="list-block list-group bullet tree dynamized" style="display: block;">
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/50-outlet" title="OUTLET">
OUTLET
<span id="leo-cat-50" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/47-adidas" title="Adidas">
Adidas
<span id="leo-cat-47" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/125-miss-sixty" title="Miss Sixty">
Miss Sixty
<span id="leo-cat-125" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/49-converse" title="Converse">
Converse
<span id="leo-cat-49" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/61-armand-basi" title="Armand Basi">
Armand Basi
<span id="leo-cat-61" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/79-marea" title="Marea">
Marea
<span id="leo-cat-79" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/86-marc-ecko" title="Marc Ecko">
Marc Ecko
<span id="leo-cat-86" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/107-festina" title="Festina">
Festina
<span id="leo-cat-107" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/135-seiko" title="Seiko">
Seiko
<span id="leo-cat-135" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/221-relojes-swatch-liquidar" title="Relojes Swatch liquidar">
Relojes Swatch liquidar
<span id="leo-cat-221" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/184-lotus" title="Lotus">
Lotus
<span id="leo-cat-184" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/195-lotus-hombre" title="Lotus Hombre">
Lotus Hombre
<span id="leo-cat-195" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/196-lotus-mujer" title="Lotus Mujer">
Lotus Mujer
<span id="leo-cat-196" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/236-lotus-infantil" title="Lotus Infantil">
Lotus Infantil
<span id="leo-cat-236" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<a href="http://www.joyeriasanchez.com/218-daniel-wellington" title="Daniel Wellington">
Daniel Wellington
<span id="leo-cat-218" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/197-viceroy" title="Viceroy">
Viceroy
<span id="leo-cat-197" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/198-viceroy-hombre" title="Viceroy Hombre">
Viceroy Hombre
<span id="leo-cat-198" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/199-viceroy-mujer" title="Viceroy Mujer">
Viceroy Mujer
<span id="leo-cat-199" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/235-viceroy-infantil" title="Viceroy Infantil">
Viceroy Infantil
<span id="leo-cat-235" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<a href="http://www.joyeriasanchez.com/51-ice-watch" title="Ice watch">
Ice watch
<span id="leo-cat-51" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/64-relojes-swatch" title="Relojes Swatch">
Relojes Swatch
<span id="leo-cat-64" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/80-mark-maddox" title="Mark Maddox">
Mark Maddox
<span id="leo-cat-80" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/81-ferrari" title="Ferrari">
Ferrari
<span id="leo-cat-81" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/173-relojes-cadete" title="Relojes Cadete">
Relojes Cadete
<span id="leo-cat-173" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/200-tous" title="Tous">
Tous
<span id="leo-cat-200" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/201-tous-kids" title="Tous Kids">
Tous Kids
<span id="leo-cat-201" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/203-tous-mujer" title="Tous Mujer">
Tous Mujer
<span id="leo-cat-203" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/204-tous-hombre" title="Tous Hombre">
Tous Hombre
<span id="leo-cat-204" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/220-certina" title="Certina">
Certina
<span id="leo-cat-220" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</div>
</div>

这是我尝试提取的代码:

req2 = requests.get(url2)
        html2 = BeautifulSoup(req2.text)
        catmenu = html2.find('div', {'id':'categories_block_left'})
        categorys = catmenu.find_all('li', recursive=False)
        for cat in categorys:
            categor = cat.find('a').getText()
            print ("   SubCategor:%s" % categor)

但是不返回任何值,我只需要获取第一个标签即可.
例:

OUTLET,
Lotus,
Daniel Wellington,
Viceroy,
Ice watch,
Relojes Swatch,
Mark Maddox,
Ferrari,
Relojes Cadete,
Tous,
Certina

解决方法:

您可以在find_all方法中指定recursive = False,这只会返回顶级li标签:

In [62]: soup.find_all('li', recursive=False)
Out[62]: 
[<li>
 <a href="http://lol.lol">Text1</a>
 <li>
 <a href="http://lol.lol">Text1</a>
 </li>
 </li>, <li>
 <a href="http://lol.lol">Text2</a>
 <li>
 <a href="http://lol.lol">Text2-2</a>
 </li></li>]

然后,您可以首先从每个li的标签中检索文本:

In [63]: [li.find('a').text for li in soup.find_all('li', recursive=False)]
Out[63]: ['Text1', 'Text2']

标签:beautifulsoup,screen-scraping,python
来源: https://codeday.me/bug/20191027/1945424.html