python - 正则表达式 (Python) 从 < 和 > 内部提取文本字符串 - 例如ETC

Question

我目前正在使用 Stack Overflow 数据转储，并试图构建（我想象的是）一个简单的正则表达式来从字符内部提取标签<名称>。因此，对于每个问题，我都有一个或多个标签的列表，例如<tagone><tag-two>...<tag-n>，并且试图仅提取标签名称列表。以下是从数据转储中获取的一些示例标记字符串：

<javascript><internet-explorer>

<c#><windows><best-practices><winforms><windows-services>

<c><algorithm><sorting><word>

<java>

作为参考，我不需要将标签名称划分为单词，因此对于<best-practices>我想返回的示例best-practices（不是bestand practices）。此外，对于它的价值，如果它有任何区别，我将使用 Python。有什么建议么？

score 3 · Accepted Answer

由于 Stackoverflow 的标签名称没有嵌入< >，您可以使用正则表达式：

<(.*?)>

或者

<([^>]*)>

解释：

<: 文字<
(..): 分组并记住比赛。
.*?: 以非贪婪的方式匹配任何东西。
>: 文字<
[^>]: 一个 char 类来匹配除 a 之外的任何东西>

score 3 · Accepted Answer

Instead of doing data dumps (whatever they are) and using regex, you may be interested in using the Stackoverflow API and json instead.

For example, to cull the tags from this question, you could do this:

import urllib2
import json
import gzip
import cStringIO

f=urllib2.urlopen('http://api.stackoverflow.com/1.0/questions/3708418?type=jsontext')
g=gzip.GzipFile(fileobj=cStringIO.StringIO(f.read()))
j=json.loads(g.read())

print(j['questions'][0]['tags'])
# [u'python', u'regex']

score 2 · Accepted Answer

这是一个快速而肮脏的解决方案：

#!/usr/bin/python

import re
pattern = re.compile("<(.*?)>")
data = """
<javascript><internet-explorer>

<c#><windows><best-practices><winforms><windows-services>

<c><algorithm><sorting><word>

<java>
"""

for each in pattern.findall(data):
    print each

更新

法定警告：如果数据转储是 XML 或 JSON 格式（正如其中一位用户评论的那样），那么最好使用合适的 XML 或 JSON 解析器。

python - 正则表达式 (Python) 从 < 和 > 内部提取文本字符串 - 例如ETC

3 回答 3

Related

Reference