python - 如何使用正则表达式从文本中提取由标签分隔的多个引用？

Question

我有一个手动输入文件，其中包含引文，每个文件的格式为：

< S sid ="2" ssid = "2">它与之前基于机器学习的NER不同之处在于，它使用整个文档的信息对每个单词进行分类，只有一个分类器。</S>< S sid =" 3" ssid = "3">以前涉及从整个文档中收集信息的工作经常使用二级分类器，它纠正了基于句子的一级分类器的错误。</S>

这是我目前使用 python 的 re 模块的方法：

citance = citance[citance.find(">")+1:citance.rfind("<")]
fd.write(citance+"\n")

我试图提取从第一个右尖括号（“>”）到最后一个左尖括号（“<”）的所有内容。但是，在多个引用的情况下，这种方法会失败，因为中间标签也被提取到输出中：

它与以前基于机器学习的 NER 不同之处在于，它使用整个文档中的信息对每个单词进行分类，只有一个分类器。< /S>< S sid ="3" ssid = "3">以前的工作涉及从整个文档中收集信息通常使用二级分类器，它纠正了基于句子的初级分类器的错误。

我想要的输出：

它与以前基于机器学习的 NER 不同之处在于，它使用整个文档中的信息对每个单词进行分类，只使用一个分类器。以前涉及从整个文档中收集信息的工作通常使用二级分类器，它纠正了基于句子的主要分类器的错误。

我怎样才能正确地实现这一点？

score 1 · Accepted Answer

我会使用 python regex 模块：re 通过这样做：

re.findall(r'\">(.*?)<', text_to_parse)

这个方法会从一个引号返回到多个引号，但是如果你想要一个统一的文本，你可以加入它们之后（" ".join(....)）

score 1 · Accepted Answer

不要使用 re 模块，而是看一下bs4库。

这是一个 XML/HTML 解析器，因此您可以获取标签之间的所有内容。

对你来说，它会是这样的：

from bs4 import BeautifulSoup

xml_text = '< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>'

text_soup = BeautifulSoup(xml_text, 'lxml')

output = text_soup.find_all('S', attrs = {'sid': '2'})

输出将包含文本：

它与以前基于机器学习的 NER 不同之处在于，它使用整个文档中的信息对每个单词进行分类，只使用一个分类器。

此外，如果您只想删除 html 标签：

import re

xml_text = '< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>'

re.sub('<.*?>', '', html_text)

将完成这项工作。

score 0 · Accepted Answer

我想这就是你要找的。

import re

string = ">here is some text<>here is some more text<"
matches = re.findall(">(.*?)<", string)
for match in matches: print match

似乎您在获得太多结果时遇到了问题。“here is some more text<”的匹配可能是从字符串中的第一个到最后一个字符，因为它们是“>”和“<”，而忽略了中间的那些。这 '。*？' idiom 将使它找到最大的命中数。

python - 如何使用正则表达式从文本中提取由标签分隔的多个引用？

3 回答 3

Related

Reference