python - 检查文本/字符串是否存在预定义的列表元素

Question

我有几个文本文件，我想将它们与由表达式和单个单词组成的词汇表进行比较。所需的输出应该是一个字典，其中包含该列表的所有元素作为键，它们在文本文件中的相应频率作为值。要构建词汇表，我需要将两个列表匹配在一起，

list1 = ['accounting',..., 'yields', 'zero-bond']
list2 = ['accounting', 'actual cost', ..., 'zero-bond']
vocabulary_list = ['accounting', 'actual cost', ..., 'yields', 'zero-bond']

sample_text = "Accounting experts predict an increase in yields for zero-bond and yields for junk-bonds."

desired_output = ['accounting':1, 'actual cost':0, ..., 'yields':2, 'zero-bond':1]

我尝试了什么：

def word_frequency(fileobj, words):
     """Build a Counter of specified words in fileobj""" 
     # initialise the counter to 0 for each word 
    ct = Counter(dict((w, 0) for w in words)) 
    file_words = (word for line in fileobj for word in line)             
    filtered_words = (word for word in file_words if word in words)       
    return Counter(filtered_words)

 def print_summary(filepath, ct): 
    words = sorted(ct.keys()) 
    counts = [str(ct[k]) for k in words] with open(filepath[:-4] + '_dict' + '.txt', mode = 'w') as outfile: 
    outfile.write('{0}\n{1}\n{2}\n\n'.format(filepath,', '.join(words),', '.join(counts))) 
    return outfile

有没有办法在 Python 中做到这一点？我想出了如何使用单个单词（1token）的词汇表来管理它，但无法找出多单词情况的解决方案？

score 0 · Accepted Answer

如果您想考虑以标点符号结尾的单词，您还需要清理文本 'yields'，即'yields!'

from collections import Counter
c = Counter()
import re

vocabulary_list = ['accounting', 'actual cost','yields', 'zero-bond']
d = {k: 0 for k in vocabulary_list}
sample_text = "Accounting experts predict actual costs an increase in yields for zero-bond and yields for junk-bonds.".lower()
splitted = set(sample_text.split())
c.update(splitted) # get count of all words 

for k in d:
    spl = k.split()
    ln = len(spl)
    # if we have multiple words we cannot split
    if ln > 1:
        check = re.findall(r'\b{0}\b'.format(k),sample_text)
        if check:
            d[k] += len(check)
    # else we are looking for a single word
    elif k in splitted:
        d[k] += c[k]
print(d)

要将所有列表链接到一个词汇字典中：

from collections import Counter
from itertools import chain
import re

c = Counter()

l1,l2 = ['accounting', 'actual cost'], ['yields', 'zero-bond']
vocabulary_dict  = {k:0 for k in chain(l1,l2)}
print(vocabulary_dict)
sample_text = "Accounting experts predict actual costs an increase in yields for zero-bond and yields for junk-bonds.".lower()
splitted = sample_text.split()
c.update(splitted)

for k in vocabulary_dict:
    spl = k.split()
    ln = len(spl)
    if ln > 1:
        check = re.findall(r'\b{0}\b'.format(k),sample_text)
        if check:
            vocabulary_dict[k] += len(check)
    elif k in sample_text.split():
        vocabulary_dict[k] += c[k]
print(vocabulary_dict)

您可以创建两个字典，一个用于短语，另一个用于单词，并对每个字典进行传递。

python - 检查文本/字符串是否存在预定义的列表元素

1 回答 1

Related

Reference