0

我有一个包含大量 URL 和普通文本的文件示例:

'http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Reference http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Informal ACADEMIC type http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#school ACADEMIC type'

我想得到:

'Reference Informal ACADEMIC type school ACADEMIC type'

我试过了

substr1 = re.findall(r"#(\w+)", text1)

它完成了部分工作,但我不知道如何提取我想要的这些部分并将它们与文本中的其他单词结合起来。本质上,我必须去掉 URL 和“#”符号。有人可以帮助我吗?

4

2 回答 2

2

扭转局面; 删除网址:

re.sub(r'\bhttps?://[^# ]+#?', '', text1)

演示:

>>> import re
>>> text1 = '\bhttp://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Reference http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Informal ACADEMIC type http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#school ACADEMIC type'
>>> re.sub(r'https?://[^# ]+#?', '', text1)
'Reference Informal ACADEMIC type school ACADEMIC type'

该表达式查找以http://or开头的任何内容https://,并删除之后不是哈希或空格的任何内容,包括可选哈希。

于 2013-11-11T14:11:51.050 回答
1

使用re.findall

>>> import re
>>> s = 'http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Reference http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Informal ACADEMIC type http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#school ACADEMIC type'
>>> ''.join(re.findall(r'#(.*?)(?=https?:|$)', s))
'Reference Informal ACADEMIC type school ACADEMIC type'

说明:http ://regex101.com/r/dV5uR2

于 2013-11-11T14:22:15.940 回答