python - 如何使用正则表达式在 OPML (XML) 文件中查找引用的属性值

Question

我正在搜索一个看起来像这样的 OPML 文件。我想提取大纲文本和 xmlUrl。

  <outline text="lol">
  <outline text="Discourse on the Otter" xmlUrl="http://discourseontheotter.tumblr.com/rss" htmlUrl="http://discourseontheotter.tumblr.com/"/>
  <outline text="fedoras of okc" xmlUrl="http://fedorasofokc.tumblr.com/rss" htmlUrl="http://fedorasofokc.tumblr.com/"/>
  </outline>

我的功能：

 import re
 rssName = 'outline text="(.*?)"'
 rssUrl =  'xmlUrl="(.*?)"'

 def rssSearch():
     doc = open('ttrss.txt')
     for line in doc:
        if "xmlUrl" in line:
            mName = re.search(rssName, line)
            mUrl = re.search(rssUrl, line)
            if mName is not None:
                print mName.group()
                print mUrl.group()

但是，返回值如下：

 outline text="fedoras of okc"
 xmlUrl="http://fedorasofokc.tumblr.com/rss"

rssName 和 rssUrl 的正确正则表达式是什么，以便我只返回引号之间的字符串？

score 3 · Accepted Answer

不要使用正则表达式来解析 XML。代码乱七八糟，出错的地方太多了。

例如，如果您的 OPML 提供者碰巧像这样重新格式化他们的输出怎么办：

<outline text="lol">
  <outline
      htmlUrl="http://discourseontheotter.tumblr.com/"
      xmlUrl="http://discourseontheotter.tumblr.com/rss"
      text="Discourse on the Otter"
  />
  <outline
      htmlUrl="http://fedorasofokc.tumblr.com/"
      xmlUrl="http://fedorasofokc.tumblr.com/rss"
      text="fedoras of okc"
  />
</outline>

这是完全正确的，它的意思完全一样。但是面向行的搜索和正则表达式之类的'outline text="(.*?)"'会中断。

而是使用 XML 解析器。你的代码会更干净、更简单、更可靠：

import xml.etree.cElementTree as ET

root = ET.parse('ttrss.txt').getroot()
for outline in root.iter('outline'):
    text = outline.get('text')
    xmlUrl = outline.get('xmlUrl')
    if text and xmlUrl:
        print text
        print xmlUrl

这可以处理您的 OPML 片段和我在网上找到的类似 OPML 文件，例如这个政治科学列表。它非常简单，没有什么棘手的。（我不是在吹牛，这只是您从使用 XML 解析器而不是正则表达式中获得的好处。）

score 2 · Accepted Answer

尝试

print mName.group(1)
print mUrl.group(1)

http://docs.python.org/2/library/re.html#re.MatchObject.group

如果 groupN 参数为零，则对应的返回值是整个匹配字符串；如果在包含范围 [1..99] 内，则为匹配相应括号组的字符串。

或者

rssName = 'outline text="(?P<text>.*?)"'

进而

print mName.group('text')

python - 如何使用正则表达式在 OPML (XML) 文件中查找引用的属性值

2 回答 2

Related

Reference