编程语言
首页 > 编程语言> > python – 在HTML中使用Beautiful Soup解析数据绑定标记

python – 在HTML中使用Beautiful Soup解析数据绑定标记

作者:互联网

我在Beautiful Soup中选择这个’div’对象然后解析其中的数据时遇到了麻烦.

首先,我必须解码HTML实体,如本网站上的功能(https://mothereff.in/html-entities).

我将采取哪些步骤,例如,以编程方式选择

(海: ‘/ S3 / fhphotos / CIRD-72K6-H9_SID_1.jpg,宽度= 1000&安培;高度= 1000&安培;模式= MAX’)

从下面的代码

<div data-bind="component: { name: &#39;product-detail&#39;, params: {hasVariants:true,name:&#39;BROOKS LOUNGE CHAIR&#39;,hasCategory:true,superCategoryName:&#39;Furniture&#39;,categoryDisplayName:&#39;Living Room&#39;,categorySlug:&#39;living-room&#39;,subcategoryDisplayName:&#39;Chairs&#39;,subcategorySlug:&#39;chairs&#39;,collection:{id:1529,name:&#39;Irondale&#39;,description:&#39;Each piece is a striking conversation-starter. Tables are made from reclaimed doors paired with salvaged architecture or old machine parts. Storage solutions are inspired by libraries of the 1940’s. Cast iron beds with linen panels as well as seating in linen, lush velvet and top-grain leather offer a distinctive found feel.&#39;,isFeatured:true,isNew:false,image:&#39;/FourHandsMarketplace/media/General/Featured%20Collections/IRONDALE.jpg?width=500&#39;,shortDescription:&#39;Moving from Parisian flea market to modern to industrial, understated elegance is a common theme. Waxed leathers and distressed irons mix with fabrics for an intriguing style blend.\r\n&#39;,uri:&#39;/collections/irondale&#39;},attributes:[{id:384,name:&#39;COVER&#39;,displayOrder:30,swatches:true,values:[{id:12710,name:&#39;EBONY&#39;,displayOrder:1,swatchUrl:&#39;/s3/fhphotos/Y C11458-G6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;},{id:12711,name:&#39;STONEWASH DARK GREEN&#39;,displayOrder:2,swatchUrl:&#39;/s3/fhphotos/Y C11458-H9_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;}]},{id:385,name:&#39;FINISH&#39;,displayOrder:40,swatches:true,values:[{id:12712,name:&#39;BLACK WASH WEATHERED&#39;,displayOrder:1,swatchUrl:&#39;/s3/fhphotos/Y C11458-K5_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;},{id:12713,name:&#39;DISTRESSED WASHED OLD OAK&#39;,displayOrder:2,swatchUrl:&#39;/s3/fhphotos/Y C11458-K6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;}]}],products:[{attributeValueIds:[12710,12712],description:&#39;Our take on the classic Adirondack emphasizes comfort with thick, top-grain leather cushioning. Wire-brushed oak is finished in black and hand-distressed for a naturally weathered patina.&#39;,dimensions:&#39;W: 27.75&quot; H: 29&quot; D: 34.75&quot;&#39;,availabilityDescription:&#39;&lt;strong>Quantity in Stock: &lt;/strong>&lt;span >88&lt;/span>&lt;br />&lt;strong>More on the Way: &lt;/strong>&lt;span >Yes&lt;/span>&lt;br />&lt;strong>Estimated Arrival Date: &lt;/strong>&lt;span >1 to 2 weeks&lt;/span>&#39;,colors:[&#39;Black Washed Weathered&#39;,&#39;Ebony&#39;],weightPounds:45.0,volumeCubicFeet:18.72,images:[{order:1,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_PRM_1.jpg&#39;},{order:2,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_1.jpg&#39;},{order:3,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_2.jpg&#39;},{order:4,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_1.jpg&#39;},{order:5,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_2.jpg&#39;},{order:6,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_BCK_1.jpg&#39;},{order:7,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_FRT_1.jpg&#39;},{order:8,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_SID_1.jpg&#39;},{order:9,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_3.jpg&#39;},{order:10,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_3.jpg&#39;},{order:11,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_4.jpg&#39;}],priceHtml:&#39;$520.00&#39;,itemNumber:&#39;CIRD-72K5-G6H6&#39;,name:&#39;Brooks Lounge Chair-Ebony, Blk Wsh Weath&#39;,availableForImmediateShipment:true,isNew:false,isCloseout:false},{attributeValueIds:[12711,12713],description:&#39;Our take on the classic Adirondack emphasizes comfort with green, stonewashed cotton canvas cushioning. Wire-brushed oak is hand-distressed for a naturally weathered patina.&#39;,dimensions:&#39;W: 27.75&quot; H: 29&quot; D: 34.5&quot;&#39;,availabilityDescription:&#39;&lt;strong>Quantity in Stock: &lt;/strong>&lt;span >147&lt;/span>&lt;br />&lt;strong>More on the Way: &lt;/strong>&lt;span >Yes&lt;/span>&lt;br />&lt;strong>Estimated Arrival Date: &lt;/strong>&lt;span >1 to 2 weeks&lt;/span>&#39;,colors:[&#39;Distressed Washed Old Oak&#39;,&#39;Stonewash Dark Green&#39;],weightPounds:45.0,volumeCubicFeet:18.72,images:[{order:1,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_PRM_1.jpg&#39;},{order:2,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_1.jpg&#39;},{order:3,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_2.jpg&#39;},{order:4,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_1.jpg&#39;},{order:5,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_2.jpg&#39;},{order:6,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_BCK_1.jpg&#39;},{order:7,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_FRT_1.jpg&#39;},{order:8,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_SID_1.jpg&#39;},{order:9,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_3.jpg&#39;},{order:10,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_3.jpg&#39;}],priceHtml:&#39;$290.00&#39;,itemNumber:&#39;CIRD-72K6-H9&#39;,name:&#39;Brooks Lounge Chair-Stonewsh Drk Green&#39;,availableForImmediateShipment:true,isNew:false,isCloseout:false}],activeItemNumber:&#39;CIRD-72K5-G6H6&#39;,priceDescription:&#39;Wholesale Price&#39;} }"></div>

解决方法:

这个html字符串来自哪里以及你有什么兴趣提取并不完全清楚,但对于Beautiful Soup部分你只需要:

soup = BeautifulSoup(s)
text = soup.div['data-bind']

其中s是您问题中的字符串.在获得’data-bind’attribute之前,我们首先得到’div’tag.

格式让我感到困惑,因为它类似于json,类似于python字典,但这些解析器都不喜欢输入.我想它的javascript?我写了一个由这个question启发的快速而脏的括号计数循环:

nest_lvl = 0
lvl_string = list()
for char in text:
    if char == '{':
        nest_lvl += 1
    elif char == '}':
        nest_lvl -= 1

    try:
        lvl_string[nest_lvl] += char
    except IndexError:          # first iter
        lvl_string.append(char)

    if char == '}':
        print nest_lvl, lvl_string[nest_lvl]
        lvl_string[nest_lvl] = ''

这将有希望让你开始.同样,解析部分实际上取决于解析器需要的一般程度以及您想要提取的内容.

标签:json,python,html-entities,beautifulsoup
来源: https://codeday.me/bug/20190711/1430074.html