python - Python中的匹配模式

Question

我有一个包含以下字符串的 XML 文件：

<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field>

在 XML 的正文中，我有>和<字符，它们与 XML 规范不兼容。我需要更换它们，以便何时>和<在：

 ' "> ' 
 ' " > ' and 
 ' </ '

分别，它们不应该被替换，所有其他出现的>和<应该被字符串“大于”和“小于”替换。所以结果应该是这样的：

 <field name="id">abcdef</field>
 <field name="intro" > pqrst</field>
 <field name="desc"> this is a test file. We will show 5 greater than 2 and 3 less than 5 and try to remove non xml compatible characters.</field>

我怎样才能用 Python 做到这一点？

score 2 · Accepted Answer

似乎我这样做是为了>：

re.sub('(?<! " )(?<! ")(?! )>','greater than', xml_string)

?<!- 消极的后视断言，

?!- 否定前瞻断言，

(...)(...)是逻辑与，

所以整个表达式的意思是“替换所有出现的'>'（不以'”'开头）和（不以'"'开头）和（不以''结尾）

情况<类似

score 2 · Accepted Answer

您可以使用lxml.etree.XMLParserwithrecover=True选项：

import sys
from lxml import etree

invalid_xml = """
<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and
try to remove non xml compatible characters.</field>
"""
root = etree.fromstring("<root>%s</root>" % invalid_xml,
                        parser=etree.XMLParser(recover=True))
root.getroottree().write(sys.stdout)

输出

<root>
<field name="id">abcdef</field>
<field name="intro"> pqrst</field>
<field name="desc"> this is a test file. We will show 5&gt;2 and 35 and
try to remove non xml compatible characters.</field>
</root>

注意：>保留在文档中>并被<完全删除（作为 xml 文本中的无效字符）。

基于正则表达式的解决方案

对于简单的类似 xml 的内容，您可以使用re.split()将标签与文本分开并在非标签文本区域中进行替换：

import re
from itertools import izip_longest
from xml.sax.saxutils import escape  # '<' -> '&lt;'

# assumptions:
#   doc = *( start_tag / end_tag / text )
#   start_tag = '<' name *attr [ '/' ] '>'
#   end_tag = '<' '/' name '>'
ws = r'[ \t\r\n]*'  # allow ws between any token
name = '[a-zA-Z]+'  # note: expand if necessary but the stricter the better
attr = '{name} {ws} = {ws} "[^"]*"'  # note: fragile against missing '"'; no "'"
start_tag = '< {ws} {name} {ws} (?:{attr} {ws})* /? {ws} >'
end_tag = '{ws}'.join(['<', '/', '{name}', '>'])
tag = '{start_tag} | {end_tag}'

assert '{{' not in tag
while '{' in tag: # unwrap definitions
    tag = tag.format(**vars())

tag_regex = re.compile('(%s)' % tag, flags=re.VERBOSE)

# escape &, <, > in the text
iters = [iter(tag_regex.split(invalid_xml))] * 2
pairs = izip_longest(*iters, fillvalue='')  # iterate 2 items at a time
print(''.join(escape(text) + tag for text, tag in pairs))

为避免标签误报，您可以删除'{ws}'上述一些内容。

输出

<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5&gt;2 and 3&lt;5 and
try to remove non xml compatible characters.</field>

注意：两者<>都在文本中进行了转义。

您可以调用任何函数而不是escape(text)上面的函数，例如，

def escape4human(text):
    return text.replace('<', 'less than').replace('>', 'greater than')

score -2 · Accepted Answer

-2

使用ElementTree进行 XML 解析。

于 2012-11-10T04:03:22.653 回答

python - Python中的匹配模式

3 回答 3

输出

基于正则表达式的解决方案

输出

Related

Reference