0

我有以下格式的文本:

        <cast_member billing="top">
            <display_name>Elijah Wood</display_name>
            <character_name>#9 (voice)</character_name>
            <locales>
                <locale name="ko-KR">
                    <display_name>일라이자 우드</display_name>
                </locale>
                <locale name="cmn-Hant">
                    <display_name>伊利亞伍德&lt;/display_name>
                </locale>
            </locales>
        </cast_member>
        <cast_member billing="top">
            <display_name>Peter Pan</display_name>
            <character_name>#8 (voice)</character_name>
        </cast_member>

当标签存在时,我将如何删除<locales>标签内和包含的所有内容。上面的输入将如下所示:

        <cast_member billing="top">
            <display_name>Elijah Wood</display_name>
            <character_name>#9 (voice)</character_name>
        </cast_member>
        <cast_member billing="top">
            <display_name>Peter Pan</display_name>
            <character_name>#8 (voice)</character_name>
        </cast_member>
4

4 回答 4

1

永远不要使用正则表达式来解析 HTML 或 XML。请改用出色的lxml库。

于 2012-08-10T20:48:47.020 回答
1

This will do the job in pure Python without Regex but it might destroy indentation and/or leave blank lines where text has been cut out

<cast_member billing="top">
    <display_name>Elijah Wood</display_name>
    <character_name>#9 (voice)</character_name>

</cast_member>
<cast_member billing="top">
    <display_name>Peter Pan</display_name>
    <character_name>#8 (voice)</character_name>
</cast_member>

here's the code:

with open('data') as f:
    text = f.read()

oTag = "<locales>"
cTag = "</locales>"

newText = ''
p = 0
s = text.find(oTag, p)
while s > -1:
    e = text.find(cTag, s)
    if e == -1:
        # ERROR: no closing tag
        pass
    newText += text[p:s]
    p = e + len(cTag)
    s = text.find(oTag, p)
newText += text[p:]

print newText,
于 2012-08-10T22:00:52.227 回答
0

您可以使用正则表达式和正则表达式替换功能

“字符串”.replace(/s/, '') -> “字符串”

您可以创建一个看起来像这样的正则表达式: /(\s+.+){0,}</locales>/ -> 这将匹配打开和关闭语言环境标记,以及介于两者之间的任何内容。

http://rubular.com/r/WTfo0b2bet看看它的实际效果

myXMLstring.replace(/(\s+.+){0,}</locales>/, '')

于 2012-08-10T20:46:51.520 回答
0

Here is what I ended up doing, using lxml:

cast_name = node.xpath("//package/video/cast/cast_member/display_name")
character_name = node.xpath("//package/video/cast/cast_member/character_name")
combined_cast = zip(cast_name, character_name)
cast = [(item1.text, item2.text) for item1, item2 in combined_cast]

[(Elijah Wood,#9 (voice)), (Peter Pan, #8 (voice))]
于 2012-08-10T20:56:25.677 回答