我有一个具有以下结构的 xml 文件,其中我有instances
几个sentence
:
<corpus>
<text>
<sentence>
<instance\>
<instance\>
<instance\>
<\sentence>
<\text>
<\corpus>
如何提取整个句子以及句子中的所有实例?
当我尝试sentence.text
时,它只给我第一个实例之前的单词,
sentence.find('instance').text
只给我第一个实例的字符串,
sentence.find('instance').tail
只给我下一个实例之前的第一个实例之后的单词。
我试过这个,因为我更喜欢简单的elementtree
:
import xml.etree.ElementTree as et
input = '''<corpus lang="en">
<text id="d001">
<sentence id="d001.s001">
Your
Oct
.
6
<instance id="d001.s001.t001" lemma="editorial" pos="n">editorial</instance>
``
The
<instance id="d001.s001.t002" lemma="Ill" pos="a">Ill</instance>
<instance id="d001.s001.t003" lemma="Homeless" pos="n">Homeless</instance>
''
<instance id="d001.s001.t004" lemma="refer" pos="v">referred</instance>
to
<instance id="d001.s001.t005" lemma="research" pos="n">research</instance>
by
us
and
<instance id="d001.s001.t006" lemma="six" pos="a">six</instance>
of
our
<instance id="d001.s001.t007" lemma="colleague" pos="n">colleagues</instance>
that
was
<instance id="d001.s001.t008" lemma="report" pos="v">reported</instance>
in
the
Sept
.
8
<instance id="d001.s001.t009" lemma="issue" pos="n">issue</instance>
of
the
Journal
of
the
American
Medical
Association
.
</sentence>
</text>
</corpus>'''
print>>open('tempfile','a+), input
corpus = et.parse('tempfile').getroot()
for text in corpus:
for sentence in text:
before1st = sentence.text
instance1st = sentence.find('instance').text
after1st = sentence.find('instance').tail
print str(before1st + instance1st + after1st).replace("\n"," ").strip()
上面的代码只输出:
Your Oct . 6 editorial `` The
所需的输出应该是完整的句子:
Your Oct . 6 editorial `` The Ill Homeless '' to research by us and six of our colleagues that was reported in the Sept . 8 issue of the Journal of the American Medical Association