2

我有一些行为良好的 xml 文件,我想使用正则表达式重新格式化(不解析!)。目标是让每一<trkpt>对都成为单线。

以下代码有效,但我希望在单个正则表达式替换而不是循环中执行操作,这样我就不需要将字符串连接回来。

import re

xml = """
    <trkseg>
      <trkpt lon="-51.2220657617" lat="-30.1072524581">
        <time>2012-08-25T10:20:44Z</time>
        <ele>0</ele>
      </trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581">
        <time>2012-08-25T10:20:44Z</time>
        <ele>0</ele>
      </trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581">
        <time>2012-08-25T10:20:44Z</time>
        <ele>0</ele>
      </trkpt>
    </trkseg>
"""

for trkpt in re.findall('<trkpt.*?</trkpt>', xml, re.DOTALL):
    print re.sub('>\s*<', '><', trkpt, re.DOTALL)

sed也欢迎使用答案。

谢谢阅读

4

4 回答 4

2

这个怎么样:

>>> regex = re.compile(
    r"""\n[ \t]*  # Match a newline plus following whitespace
    (?=           # only if... 
     (?:          # ...the following can be matched:
      (?!<trkpt)  #  (unless an opening <trkpt> tag occurs first)
      .           #  any character
     )*           # any number of times,
     </trkpt>     # followed by a closing </trkpt> tag
    )             # End of lookahead""", 
    re.DOTALL | re.VERBOSE)
>>> print regex.sub("", xml)

    <trkseg>
      <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
    </trkseg>
于 2012-08-30T21:53:20.493 回答
1

这并不是您真正要求的,但为了成为单线,这里有一个单线:

>>> print re.sub(r'(<trkpt.*?</trkpt>)',
                 lambda m: re.sub(r'>\s*<', '><', m.group(1), re.DOTALL),
                 xml, flags=re.DOTALL)

<trkseg>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>

另请注意,如果任何字符串属性包含 string "<trkpt",则此方法将中断,这可能不会发生,但这是不使用真正解析器的问题。

于 2012-08-30T21:11:34.863 回答
1

你想保留<trkseg>吗?如果是这样,这可能对您有用:

print re.sub('([^gt])>\s*<', '\g<1>><', xml, re.DOTALL)

删除元素之间的所有空格,条件是前一个元素不以 t 或 g 结尾。

<trkseg>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>
于 2012-08-30T21:19:43.253 回答
1

另一个单行是

print re.sub("(<trkpt.+?>).*?(<time>.+?</time>).*?(<ele>.+?</ele>).*?(</trkpt>)",
             r'\1\2\3\4', xml, re.DOTALL)

生产

<trkseg>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
</trkseg>

这具有易于更改其他标签的优点。

于 2012-08-30T21:25:26.153 回答