python - 用于返回这些值的正则表达式或其他方法（来自 BeautifulSoup）

Question

在 stackoverflow 社区的大力帮助下，我学到了很多关于 python 的知识，特别是使用 BeautifulSoup 进行抓取。我再次指的是我用来学习的同一个示例页面。

我有以下代码：

from bs4 import BeautifulSoup
import re

f = open('webpage.txt', 'r')
g = f.read()
soup = BeautifulSoup(g)

for heading in soup.find_all("td", class_="paraheading"):
    key = " ".join(heading.text.split()).rstrip(":")
    if key in columns:
        print key
        next_td = heading.find_next_sibling("td", class_="bodytext")
        value = " ".join(next_td.text.split())
        print value
    if key == "Industry Categories":
        print key
        ic_next_td = heading.find_next_sibling("td", class_="bodytext")
        print ic_next_td

从这个页面：

http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113

另存为webpage.txt 给我以下结果：

ACN
007 350 807
ABN
71 007 350 807
Annual Turnover
$5M - $10M
Number of Employees
6-10
QA
ISO9001-2008, AS9120B,
Export Percentage
5 %
Industry Categories
<td class="bodytext">Aerospace<br/>Land (Vehicles, etc)<br/>Logistics<br/>Marine<br/>Procurement<br/></td>
Company Email
lisa@aerospacematerials.com.au
Company Website
http://www.aerospacematerials.com.au
Office
2/6 Ovata Drive Tullamarine VIC 3043
Post
PO Box 188 TullamarineVIC 3043
Phone
+61.3. 9464 4455
Fax
+61.3. 9464 4422

到目前为止，一切都很好。稍后会考虑将其写入 CSV 或其他内容，但现在我想知道如何将包含的数据分解<td class="bodytext">Aerospace Land (Vehicles, etc) Logistics Marine Procurement </td>为单独的行？

像这样：

Industry Categories
Aerospace
Land (Vehicles, etc)
Logistics
Marine
Procurement

我尝试了一些正则表达式，例如：

if key == "Industry Categories":
        print key
        ic_next_td = heading.find_next_sibling("td", class_="bodytext")
        value = re.findall('\>(.*?)\<', ic_next_td)
        print value[0]

但我明白了TypeError: expected string or buffer。我想我也需要遍历 findall 或其他东西。

该方法需要足够通用以处理相同格式的其他变体，例如“驴”或“船”而不是“航空航天”或“物流”（我不一定会预先知道我所在的场景中的所有可能性考虑）。

有没有办法使用 br 标签和 Beautiful soup 或正则表达式来解决这个问题？

抱歉，这有点长。与往常一样，对于任何建议的代码优化也非常高兴，因此我可以继续学习正确构建 Python 脚本的最佳方法。

谢谢！

更新

此代码有效：

for heading in soup.find_all("td", class_="paraheading"):
    key = " ".join(heading.text.split()).rstrip(":")
    if key in columns:
        print key
        next_td = heading.find_next_sibling("td", class_="bodytext")
        value = " ".join(next_td.text.split())
        print value
    if key == "Industry Categories":
        print key
        ic_next_td = heading.find_next_sibling("td", class_="bodytext")
        for value in ic_next_td.strings:
                print value

这段代码产生了一个缩进错误：

for heading in soup.find_all("td", class_="paraheading"):
    key = " ".join(heading.text.split()).rstrip(":")
    if key in columns:
        print key
        next_td = heading.find_next_sibling("td", class_="bodytext")
        value = " ".join(next_td.text.split())
        print value
    if key == "Industry Categories":
        print key
        ic_next_td = heading.find_next_sibling("td", class_="bodytext")
        for value in ic_next_td.strings:
            print value

print value注意在工作代码中看似双重缩进。在我看来，下一级缩进将是一个单一的缩进之后for value in ic_next_td.strings:？

score 3 · Accepted Answer

您将不得不进一步解析其中的内容ic_next_td。幸运的是，原始页面使用 标签为您提供了分隔文本的位置。不要在这里使用正则表达式，BeautifulSoup 为您提供了更好的工具：

for value in ic_next_td.strings:
    print value

会导致：

Aerospace
Land (Vehicles, etc)
Logistics
Marine
Procurement

您可以通过调用迭代器list()将所有这些存储在列表中：.strings

values = list(ic_next_td.strings)

python - 用于返回这些值的正则表达式或其他方法（来自 BeautifulSoup）

1 回答 1

Related

Reference