python - 如何从字符串中找到多字字符串，并在python中标记它？

Question

例如，句子是"The corporate balance sheets data are available on an annual basis"，我需要标记"corporate balance sheets"从给定句子中找到的子字符串。

所以，我需要找到的模式是：

"corporate balance sheets"

给定字符串：

"The corporate balance sheets data are available on an annual basis".

我想要的输出标签序列将是：

[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

有一堆句子（超过 2GB），还有一堆我需要找到的模式。我不知道如何在 python 中有效地做到这一点。有人可以给我一个好的算法吗？

score 1 · Accepted Answer

列表理解和使用拆分：

import re
lst=[]
search_word = 'corporate balance sheets'
p = re.compile(search_word)
sentence="The corporate balance sheets data are available on an annual basis"

lst=[1 for i in range(len(search_word.split()))]
vect=[ lst if items == '__match_word' else 0 for items in re.sub(p,'__match_word',sentence).split()]
vectlstoflst=[[vec] if isinstance(vec,int) else vec for vec in vect]
flattened = [val for sublist in vectlstoflst for val in sublist]

输出：

 [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

Sentence ="公司资产负债表数据可在年表上获得"

输出

[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

score 1 · Accepted Answer

由于子字符串中的所有单词都必须匹配，因此您可以all在遍历句子时检查并更新适当的索引：

def encode(sub, sent):
    subwords, sentwords = sub.split(), sent.split()
    res = [0 for _ in sentwords]    
    for i, word in enumerate(sentwords[:-len(subwords) + 1]):
        if all(x == y for x, y in zip(subwords, sentwords[i:i + len(subwords)])):
            for j in range(len(subwords)):
                res[i + j] = 1
    return res


sub = "corporate balance sheets"
sent = "The corporate balance sheets data are available on an annual basis"
print(encode(sub, sent))
# [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

sent = "The corporate balance data are available on an annual basis sheets"
print(encode(sub, sent))
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

python - 如何从字符串中找到多字字符串，并在python中标记它？

2 回答 2

Related

Reference