3

我有一个 sgm 文件,格式如下:

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="16321" NEWID="1001">
<DATE> 3-MAR-1987 09:18:21.26</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D><D>ussr</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;G T
&#22;&#22;&#1;f0288&#31;reute
d f BC-SANDOZ-PLANS-WEEDKILL   03-03 0095</UNKNOWN>
<TEXT>&#2;
<TITLE>SANDOZ PLANS WEEDKILLER JOINT VENTURE IN USSR</TITLE>
<DATELINE>    BASLE, March 3 - </DATELINE><BODY>Sandoz AG said it planned a joint venture
to produce herbicides in the Soviet Union.
    The company said it had signed a letter of intent with the
Soviet Ministry of Fertiliser Production to form the first
foreign joint venture the ministry had undertaken since the
Soviet Union allowed Western firms to enter into joint ventures
two months ago.
    The ministry and Sandoz will each have a 50 pct stake, but
a company spokeswoman was unable to give details of the size of
investment or planned output.
 Reuter
&#3;</BODY></TEXT>
</REUTERS>

同一个文件中有 1000 条根节点为 RETURNS 的记录。我想从每条记录中提取正文标签并在上面做一些事情,但是我做不到。以下是我的代码

from bs4 import BeautifulSoup,SoupStrainer
f = open('dataset/reut2-001.sgm', 'r')
data= f.read()
soup = BeautifulSoup(data)
topics= soup.findAll('body') # find all body tags
print len(topics)  # print number of body tags in sgm file
i=0
for link in topics:         #loop through each body tag and print its content 
    children = link.findChildren()
    for child in children:
        if i==0:
            print child
        else:
            print "none"
            i=i+1

print i

问题是 for 循环不打印 body 标签的内容 - 而是打印记录本身。

4

2 回答 2

3

正如我在评论中所说,由于未知(对我而言)原因,您不应该将标签命名为body.

因此,第一步:将body标签名称替换为,例如content

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="16321" NEWID="1001">
<DATE> 3-MAR-1987 09:18:21.26</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D><D>ussr</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;G T
&#22;&#22;&#1;f0288&#31;reute
d f BC-SANDOZ-PLANS-WEEDKILL   03-03 0095</UNKNOWN>
<TEXT>&#2;
<TITLE>SANDOZ PLANS WEEDKILLER JOINT VENTURE IN USSR</TITLE>
<DATELINE>    BASLE, March 3 - </DATELINE><CONTENT>Sandoz AG said it planned a joint venture
to produce herbicides in the Soviet Union.
    The company said it had signed a letter of intent with the
Soviet Ministry of Fertiliser Production to form the first
foreign joint venture the ministry had undertaken since the
Soviet Union allowed Western firms to enter into joint ventures
two months ago.
    The ministry and Sandoz will each have a 50 pct stake, but
a company spokeswoman was unable to give details of the size of
investment or planned output.
 Reuter
&#3;</CONTENT></TEXT>
</REUTERS>

这是代码:

from bs4 import BeautifulSoup,SoupStrainer
f = open('dataset/reut2-001.sgm', 'r')
data= f.read()
soup = BeautifulSoup(data)
contents = soup.findAll('content')
for content in contents:
    print content.text
于 2013-04-08T16:34:09.690 回答
0

您只需要相关的解析器(xml / lxml / html.parser / 等)。我在使用'lxml'解析器从SGML文件中提取HTML标签时遇到了同样的问题,并通过将解析器更改为'html.parser'来解决它。

之前

soup = BeautifulSoup(file_content, 'lxml')

之后/解决方案

soup = BeautifulSoup(file_content, 'html.parser')

来自文档的参考

HTML解析器之间也存在差异。如果你给 Beautiful Soup 一个格式完美的 HTML 文档,这些差异就无关紧要了。一个解析器会比另一个更快,但它们都会为您提供一个看起来与原始 HTML 文档完全相同的数据结构。

但是如果文档的格式不完美,不同的解析器会给出不同的结果。

于 2021-11-19T12:46:09.040 回答