1

我是 python 新手,我正在尝试使用以下代码将 .xml 文件中标识为位置的所有标记打印到 .txt 文件中:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('exercise-ner.xml', 'r'))

tokenlist = soup.find_all('token')

output = ''

for x in tokenlist:

   readeachtoken = x.ner.encode_contents()

   checktoseeifthetokenisalocation = x.ner.encode_contents().find("LOCATION")

   if checktoseeifthetokenisalocation != -1:

   output += "\n%s" % x.word.encode_contents()

z = open('exercise-places.txt','w')

z.write(output)

z.close()

该程序运行,并吐出所有作为位置的标记的列表,每个标记都打印在输出文件中自己的行上。然而,我想做的是修改我的程序,以便当美丽的汤发现两个或多个被标识为位置的相邻标记时,它可以将这些标记打印到输出文件中的同一行。有谁知道我可以如何修改我的代码来完成这个?如果您能提供任何建议,我将不胜感激。

4

1 回答 1

0

这个问题很老了,但我刚收到@Amanda 的注释,我想我会发布我的任务方法,以防它可能对其他人有所帮助:

import glob, codecs
from bs4 import BeautifulSoup

inside_location = 0
location_string = ''

with codecs.open("washington_locations.txt","w","utf-8") as out:
    for i in glob.glob("/afs/crc.nd.edu/user/d/dduhaime/java/stanford-corenlp-full-2015-01-29/processed_washington_correspondence/*.xml"):
        locations = []

        with codecs.open(i,'r','utf-8') as f:
            soup   = BeautifulSoup(f.read())
            tokens = soup.findAll('token')
            for token in tokens:
                if token.ner.string     == "LOCATION":
                    inside_location = 1
                    location_string += token.word.string + u" "
                else:
                    if location_string:
                        locations.append( location_string )
                        location_string = ''

        out.write( i + "\t" + "\t".join(l for l in locations) + "\n" )
于 2015-03-25T22:58:12.483 回答