其他分享
首页 > 其他分享> > 使用BeautifulSoup刮取表

使用BeautifulSoup刮取表

作者:互联网

我有一个问题,我认为这很简单.我具有以下类型的页面,我希望从该页面中收集上一张表中的信息(如果一直向下滚动,则为“过程”框中的那个):

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-2&language=EN

我要抓取的表格的html如下所示:

<tbody><tr class="doc_title">
<td style="background-image: url(&quot;/img/struct/navigation/gradient_blue.gif&quot;);" align="left" valign="top"><img src="/img/struct/functional/arrow_title_doc.gif" alt="" align="absmiddle" border="0" height="14" width="8"> <span style="font-weight: bold;">PROCEDURE</span></td><td style="background-image: url(&quot;/img/struct/navigation/gradient_blue.gif&quot;);" align="right" valign="top">
<table border="0" cellpadding="3" cellspacing="0" width="50">
<tbody><tr><td align="center"><a href="#top"><img src="/img/struct/functional/top_doc.gif" alt="" border="0" height="16" width="16"></a></td><td align="center"><img src="/img/struct/navigation/spacer.gif" alt="" border="0" height="10" width="15"></td><td align="center"><a href="#title2"><img src="/img/struct/functional/sort_up.gif" alt="" border="0" height="10" width="15"></a></td></tr></tbody></table></td></tr>

<tr class="contents" valign="top"><td colspan="2">
<p></p><table style="border-collapse: collapse; width: 481.85pt;" align="center" cellspacing="0">
<tbody><tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Title</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">Mutual assistance for the recovery of claims relating to taxes, duties and other measures</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">References</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style=""><a href="http://ec.europa.eu/prelex/liste_resultats.cfm?CL=en&amp;ReqId=0&amp;DocType=COM&amp;DocYear=2009&amp;DocNum=0028">COM(2009)0028</a> – C6-0061/2009 – <a href="/oeil/FindByProcnum.do?lang=en&amp;procnum=CNS/2009/0007">2009/0007(CNS)</a></p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Date of consulting Parliament</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">16.2.2009</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Committee responsible</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date announced in plenary</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">ECON</p>

<p style="">19.10.2009</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0pt 0.75pt; border-style: solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Committee(s) asked for opinion(s)</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date announced in plenary</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">CONT</p>

<p style="">19.10.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">JURI</p>

<p style="">19.10.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0pt 0pt; border-style: solid solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Not delivering opinions</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date of decision</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">CONT</p>

<p style="">1.10.2009</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">JURI</p>

<p style="">5.10.2009</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Rapporteur(s)</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date appointed</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="3">
<p style="">Theodor Dumitru Stolojan</p>

<p style="">21.7.2009</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0pt 0.75pt; border-style: solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Discussed in committee</span></p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">10.11.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">1.12.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">21.1.2010</p>
</td>
<td style="border-width: 0.75pt 1pt 0pt 0pt; border-style: solid solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Date adopted</span></p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">27.1.2010</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Result of final vote</span></p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 12.94%;" rowspan="1" colspan="1">
<p style="">+:</p>

<p style="">–:</p>

<p style="">0:</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 48.82%;" rowspan="1" colspan="6">
<p style="">39</p>

<p style="">0</p>

<p style="">1</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Members present for the final vote</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">Burkhard Balz, Sharon Bowles, Udo Bullmann, Pascal Canfin, Nikolaos Chountis, George Sabin Cutaş, Leonardo Domenici, Derk Jan Eppink, Markus Ferber, Elisa Ferreira, Vicky Ford, José Manuel García-Margallo y Marfil, Jean-Paul Gauzès, Sylvie Goulard, Enikő Győri, Liem Hoang Ngoc, Eva Joly, Othmar Karas, Wolf Klinz, Jürgen Klute, Werner Langen, Astrid Lulling, Arlene McCarthy, Ivari Padar, Alfredo Pallone, Anni Podimata, Antolín Sánchez Presedo, Olle Schmidt, Edward Scicluna, Peter Simon, Peter Skinner, Theodor Dumitru Stolojan, Ivo Strejček, Kay Swinburne, Marianne Thyssen, Ramon Tremosa i Balcells</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-left: 0.75pt solid rgb(0, 0, 0); border-right: 1pt solid rgb(0, 0, 0); border-top: 0.75pt solid rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Substitute(s) present for the final vote</span></p>
</td>
<td style="border-left: 0.75pt solid rgb(0, 0, 0); border-right: 1pt solid rgb(0, 0, 0); border-top: 0.75pt solid rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">Marta Andreasen, Sophie Briard Auconie, David Casa, Danuta Jazłowiecka, Arturs Krišjānis Kariņš, Philippe Lamberts, Andreas Schwab</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 38.24%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 12.94%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 2.94%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 4.71%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 10.58%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 10%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 5.29%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 15.3%;" rowspan="1" colspan="1"></td>
<td style="" rowspan="1" colspan="1"></td></tr>
</tbody></table>
</td></tr>
</tbody>

我面临的问题是表格的标签没有标识符(据我所知),因此我不知道如何选择该表格并从中获取信息.到目前为止,我一直在使用BeautifilSoup来从网站上获取其他信息,但是对于如何抓取这张桌子我却一头雾水.

如果有人能告诉我如何进行,我将不胜感激!

亲切的问候,

汤玛士

解决方法:

如果您比较聪明,可以通过其他属性查找元素.我是在抓取您的数据时拍摄的,它可能不是最好的-但是,它可以使您接近.

我注意到的第一件事是您肯定在第二次出现“ PROCEDURE”一词之后才想要数据(第一个是链接,第二个是标题).因此,我对此进行了分解:

data = html.split("PROCEDURE", 2)[2]

然后,我寻找了< td>具有rowpan = 1的标签:

bs = BeautifulSoup.BeautifulSoup(data)
tds = bs.findAll("td", { "rowspan": 1 })

越来越近…

>>> tds[0].text
u'Title'
>>> tds[1].text
u'Mutual assistance for the recovery of claims relating to taxes, duties and other measures'
>>> tds[3].text
u'References'
>>> tds[4].text
u'COM(2009)00282009/0007(CNS)2009 a>'

请注意,我跳过了tds中的索引2,因为它们使用了分隔符或其他内容(它是空的).无论如何,这是一个开始.我在BeautifulSoup中发现的真正诀窍是仅将数据提供给您知道要查找的区域,因为这样一来,浏览的内容就更少了.它也以接受糟糕的输入而自豪,因此不要害怕将其作为垃圾.

我在元素列表中走得更远,但这并不完美.您需要优化搜索,因为它们具有< td>值在< td> s内的元素.

标签:beautifulsoup,screen-scraping,python
来源: https://codeday.me/bug/20191210/2098070.html