python - 使用python从文本中提取以符号开头的字符串并与其他字符串组合

Question

我有一个包含大量 URL 和普通文本的文件示例：

'http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Reference http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Informal ACADEMIC type http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#school ACADEMIC type'

我想得到：

'Reference Informal ACADEMIC type school ACADEMIC type'

我试过了

substr1 = re.findall(r"#(\w+)", text1)

它完成了部分工作，但我不知道如何提取我想要的这些部分并将它们与文本中的其他单词结合起来。本质上，我必须去掉 URL 和“#”符号。有人可以帮助我吗？

score 2 · Accepted Answer

扭转局面; 删除网址：

re.sub(r'\bhttps?://[^# ]+#?', '', text1)

演示：

>>> import re
>>> text1 = '\bhttp://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Reference http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Informal ACADEMIC type http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#school ACADEMIC type'
>>> re.sub(r'https?://[^# ]+#?', '', text1)
'Reference Informal ACADEMIC type school ACADEMIC type'

该表达式查找以http://or开头的任何内容https://，并删除之后不是哈希或空格的任何内容，包括可选哈希。

score 1 · Accepted Answer

使用re.findall：

>>> import re
>>> s = 'http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Reference http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Informal ACADEMIC type http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#school ACADEMIC type'
>>> ''.join(re.findall(r'#(.*?)(?=https?:|$)', s))
'Reference Informal ACADEMIC type school ACADEMIC type'

说明：http ://regex101.com/r/dV5uR2

python - 使用python从文本中提取以符号开头的字符串并与其他字符串组合

2 回答 2

Related

Reference