python - 如何制作它以便我可以读取仅包含特定单词的文本文件？

Question

如何让我的代码仅读取文本文件中的特定单词并显示单词和计数（单词出现在文本文件中的次数）？

from collections import Counter
import re

def openfile(filename):
 fh = open(filename, "r+")
 str = fh.read()
 fh.close()
 return str

def removegarbage(str):
 str = re.sub(r'\W+', ' ', str)
 str = str.lower()
 return str

def getwordbins(words):
 cnt = Counter()
 for word in words:
    cnt[word] += 1
 return cnt

 def main(filename, topwords):
   txt = openfile(filename)
   txt = removegarbage(txt)
   words = txt.split(' ')
   bins = getwordbins(words)
   for key, value in bins.most_common(topwords):
    print key,value

  main('filename.txt', 10)

score 1 · Accepted Answer

生成文件中所有单词的生成器会派上用场：

from collections import Counter
import re

def words(filename):
    regex = re.compile(r'\w+')
    with open(filename) as f:
        for line in f:
            for word in regex.findall(line):
                yield word.lower()

然后，要么：

wordcount = Counter(words('filename.txt'))               
for word in ['foo', 'bar']:
    print word, wordcount[word]

或者

words_to_count = set(['foo', 'bar'])
wordcount = Counter(word for word in words('filename.txt') 
                    if word in words_to_count)               
print wordcount.items()

score 1 · Accepted Answer

我认为您正在寻找的是一个简单的字典结构。这不仅可以让您跟踪要查找的单词，还可以跟踪它们的数量。

字典将事物存储为键/值对。因此，例如，您可以拥有关键字“alice”（您要查找的单词，并将其值设置为您找到该关键字的次数。

检查字典中是否有内容的最简单方法是通过 Python 的in关键字。IE

if 'pie' in words_in_my_dict: do something

有了这些信息，设置字数计数器就很容易了！

def get_word_counts(words_to_count, filename):
    words = filename.split(' ')
    for word in words:
        if word in words_to_count:
            words_to_count[word] += 1
    return words_to_count

if __name__ == '__main__':

    fake_file_contents = (
        "Alice's Adventures in Wonderland (commonly shortened to "
        "Alice in Wonderland) is an 1865 novel written by English"
        " author Charles Lutwidge Dodgson under the pseudonym Lewis"
        " Carroll.[1] It tells of a girl named Alice who falls "
        "down a rabbit hole into a fantasy world populated by peculiar,"
        " anthropomorphic creatures. The tale plays with logic, giving "
        "the story lasting popularity with adults as well as children."
        "[2] It is considered to be one of the best examples of the literary "
        "nonsense genre,[2][3] and its narrative course and structure, "
        "characters and imagery have been enormously influential[3] in "
        "both popular culture and literature, especially in the fantasy genre."
        )

    words_to_count = {
        'alice' : 0,
        'and' : 0,
        'the' : 0
        }

    print get_word_counts(words_to_count, fake_file_contents)

这给出了输出：

{'and': 4, 'the': 5, 'alice': 0}

由于dictionary存储了我们想要计算的单词和它们出现的时间。整个算法只是检查每个单词是否在中dict，如果结果是我们，我们添加1到该单词的值。

在这里阅读字典。

编辑：

如果你想计算所有的单词，然后找到一个特定的集合，字典对于这项任务来说仍然很棒（而且速度很快！）。

我们需要做的唯一更改是首先检查字典是否key存在，如果不存在，则将其添加到字典中。

例子

def get_all_word_counts(filename):
    words = filename.split(' ')

    word_counts = {}
    for word in words: 
        if word not in word_counts:     #If not already there
            word_counts[word] = 0   # add it in.
        word_counts[word] += 1          #Increment the count accordingly
    return word_counts

这给出了输出：

and : 4
shortened : 1
named : 1
popularity : 1
peculiar, : 1
be : 1
populated : 1
is : 2
(commonly : 1
nonsense : 1
an : 1
down : 1
fantasy : 2
as : 2
examples : 1
have : 1
in : 4
girl : 1
tells : 1
best : 1
adults : 1
one : 1
literary : 1
story : 1
plays : 1
falls : 1
author : 1
giving : 1
enormously : 1
been : 1
its : 1
The : 1
to : 2
written : 1
under : 1
genre,[2][3] : 1
literature, : 1
into : 1
pseudonym : 1
children.[2] : 1
imagery : 1
who : 1
influential[3] : 1
characters : 1
Alice's : 1
Dodgson : 1
Adventures : 1
Alice : 2
popular : 1
structure, : 1
1865 : 1
rabbit : 1
English : 1
Lutwidge : 1
hole : 1
Carroll.[1] : 1
with : 2
by : 2
especially : 1
a : 3
both : 1
novel : 1
anthropomorphic : 1
creatures. : 1
world : 1
course : 1
considered : 1
Lewis : 1
Charles : 1
well : 1
It : 2
tale : 1
narrative : 1
Wonderland) : 1
culture : 1
of : 3
Wonderland : 1
the : 5
genre. : 1
logic, : 1
lasting : 1

split(' ')注意：正如你所看到的，当我们创建文件时有几个“失败” 。具体来说，有些单词附有左括号或右括号。您必须在文件处理中考虑到这一点。但是，我让您自己弄清楚！

score 1 · Accepted Answer

我认为做这么多功能太复杂了，为什么不在一个功能中做呢？

# def function if desired
# you may have the filepath/specific words etc as parameters

 f = open("filename.txt")
 counter=0
 for line in f:
     # you can remove punctuation, translate them to spaces,
     # now any interesting words will be surrounded by spaces and
     # you can detect them
     line = line.translate(maketrans(".,!? ","     "))
     words = line.split() # splits on any number of whitespaces
     for word in words:
         if word == specificword:
             # of use a list of specific words: 
             # if word in specificwordlist:
             counter+=1
             print word
             # you could also append the words to some list, 
             # create a dictionary etc
 f.close()

score 0 · Accepted Answer

这可能就足够了......不完全是你问的，但最终结果是你想要的（我认为）

interesting_words = ["ipsum","dolor"]

some_text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec viverra consectetur sapien, sed posuere sem rhoncus quis. Mauris sit amet ligula et nulla ultrices commodo sed sit amet odio. Nullam vel lobortis nunc. Donec semper sem ut est convallis posuere adipiscing eros lobortis. Nullam tempus rutrum nulla vitae pretium. Proin ut neque id nisi semper faucibus. Sed sodales magna faucibus lacus tristique ornare.
"""

d = Counter(some_text.split())
final_list = filter(lambda item:item[0] in interesting_words,d.items())

但是它的复杂性并不好，因此在大文件和/或“interesting_words”的大列表上可能需要一段时间

python - 如何制作它以便我可以读取仅包含特定单词的文本文件？

4 回答 4

编辑：

例子

Related

Reference