python - 使用 BeautifulSoup 提取特定数据

Question

我想从这个片段中提取一些数据：

<div id="information_content">
    <b>Name:</b> file.rar <br>
    <b>Date Modified:</b> 2 days ago <br>
    <b>Size:</b> 212.19 MB <br>
    <b>Type:</b> Archive <br>
    <b>Permissions:</b> Public </div>
</div>

我只想提取212.19 MB.

我已经提取了代码片段，soup.find('div', attrs={'id': 'information_content'})但我不知道如何进一步深入以获得我需要的东西。

有人可以帮忙吗？

score 0 · Accepted Answer

如前所述，如果这些 div 的结构始终相同，则拆分后大小将在第三个字符串中。

>>>> x = '<div id="information_content"> <b>Name:</b> file.rar <br> <b>Date Modified:</b> 2 days ago <br> <b>Size:</b> 212.19 MB <br> <b>Type:</b> Archive <br> <b>Permissions:</b> Public </div> </div>'
>>>> x.split('<br>')[2]
' <b>Size:</b> 212.19 MB '

从那里您可以使用正则表达式来获取您需要的部分。例如，此模式匹配这种格式的所有值：

\d+.\d\d\s.B

它匹配 10.00 kB 和 1000.34 TB

score 0 · Accepted Answer

如果 DIV 始终具有相同的结构，您可以按照此说明使用 BeautifulSoup。提取 DIV 后，使用文本创建一个新的 LIST，用“\n”分隔。然后，只需选择列表中的正确元素。

我做了类似的事情，在这里我解释了我所做的一切：Python和BeautifulSoup：从Quiniela中提取奖品 - http://www.manejandodatos.es/2014/2/python-beautifulsoup-extracting-prizes-quiniela

我希望它有帮助！

score 0 · Accepted Answer

0

由于 BeautifulSoup 不支持 Xpath，因此最好的方法是使用lxml。

于 2014-02-13T11:15:35.647 回答

python - 使用 BeautifulSoup 提取特定数据

3 回答 3

Related

Reference