python - 使用 Python 从 HTML 站点中提取多行数据

Question

所以我在提取数据方面取得了巨大的成功，只要我匹配的内容不超过 1 行，如果它超过 1 行，我就会感到胃灼热（似乎）......这是 HTML 数据的片段我得到：

<tr>
<td width=20%>3 month
<td width=1% class=bar>
&nbsp;
<td width=1% nowrap class="value chg">+10.03%
<td width=54% class=bar>
<table width=100% cellpadding=0 cellspacing=0 class=barChart>
<tr>

我对“+10.03%”的数字感兴趣，并且

<td width=20%>3 month

是让我知道“+10.03%”是我想要的模式。

所以到目前为止我在 Python 中已经有了这个：

percent = re.search('<td width=20%>3 month\r\n<td width=1% class=bar>\r\n&nbsp;\r\n<td width=1% nowrap class="value chg">(.*?)', content)

其中变量内容包含我正在搜索的所有 html 代码。这似乎对我不起作用......任何建议将不胜感激！我读过其他几篇关于 re.compile() 和 re.multiline() 的帖子，但我对它们没有任何运气，主要是因为我不明白它们是如何工作的……

score 0 · Accepted Answer

感谢大家的帮助！您为我指明了正确的方向，这就是我如何让我的代码与 BeautifulSoup 一起工作。我注意到我想要的所有数据都在一个名为“value chg”的类下，然后我的数据始终是该搜索中的第 3 和第 5 个元素，所以这就是我所做的：

from BeautifulSoup import BeautifulSoup
import urllib

content = urllib.urlopen(url).read()
soup = BeautifulSoup(''.join(content))

td_list = soup.findAll('td', {'class':'value chg'} )

mon3 = td_list[2].text.encode('ascii','ignore')
yr1 = td_list[4].text.encode('ascii','ignore')

同样，“内容”是我下载的 HTML。

score 0 · Accepted Answer

您需要添加“多行”正则表达式开关(?m)。您可以使用以下方式直接提取目标内容findall并获取匹配的第一个元素findall(regex, content)[0]：

percent = re.findall(r'(?m)<td width=20%>3 month\s*<td width=1% class=bar>\s*&nbsp;\s*<td width=1% nowrap class="value chg">(\S+)', content)[0]

通过使用\s*匹配换行符，正则表达式与 unix 和 windows 样式的行终止符兼容。

查看以下测试代码的现场演示：

import re
content = '<tr>\n<td width=20%>3 month\n<td width=1% class=bar>\n&nbsp;\n<td width=1% nowrap class="value chg">+10.03%\n<td width=54% class=bar>\n<table width=100% cellpadding=0 cellspacing=0 class=barChart>\n<tr>'        
percent = re.findall(r'(?m)<td width=20%>3 month\s*<td width=1% class=bar>\s*&nbsp;\s*<td width=1% nowrap class="value chg">(\S+)', content)[0]
print(percent)

输出：

+10.03%

python - 使用 Python 从 HTML 站点中提取多行数据

2 回答 2

Related

Reference