0

我正在尝试使用 ElementTree 在 python 中解析一些像这样结构的 .nxml 文件.....

<body>
    <sec>
        <title>INTRODUCTION</title>
        <p>Experimentation with substances usually takes place during adolescence [<xref ref-type="bibr" rid="b1">1</xref>]. Adolescents are highly vulnerable to social influences [<xref ref-type="bibr" rid="b2">2</xref>], have lower tolerance levels and become dependent at lower doses than adults [<xref ref-type="bibr" rid="b3">3</xref>]. Adolescent-onset substance abuse is characterized by more rapid development of multiple drug dependencies and more severe psychopathology [<xref ref-type="bibr" rid="b4">4</xref>]. However, the majority of adolescents who experiment with substances do not become problem users. A better understanding is needed of the factors underlying initiation of substance use in adolescence versus heavy use and problem use. Specifically, if the liability to progress to heavier substance use is influenced by processes other than those that influence initiation, then primary prevention/intervention programmes can be only partly effective. It may be more successful, in terms of both cost and impact, to target those factors implicated in the progression to heavy/problem use. However, if the underlying liabilities to initiation and progression were strongly related, interventions could be tailored to both behaviours.</p>

具体来说,我试图提取之间的文本

<p> </p> tags. 

然而元素

[<xref> </xref>] 

在文本中正在中断解析。

我试过使用

for sec in body:
    for p in sec:
        for e in p:
           e.remove (xref)

但元素未被识别。有任何想法吗?

4

2 回答 2

1

这更有可能奏效:

for xref in body.findall('xref'):
    body.remove(xref)

为了更符合您一直在做的事情,请尝试:

for sec in body.findall('sec'):
    for p in sec.findall('p'):
        for e in p.findall('xref'):
           p.remove(e)
于 2013-10-11T14:28:37.693 回答
0

实际上,我将其全部废弃并使用 BeautifulSoup 删除所有标签。工作了一个款待。不敢相信我是这样的笨蛋。

于 2013-10-12T08:16:38.307 回答