python - 如何从整个文件的列表中计算词频？

Question

我有一个包含三列的文件（由 \t 分隔；第一列是单词，第二列是引理，第三列是标签）。有些行仅由点或逗号组成。

<doc n=1 id="CMP/94/10">
<head p="80%">
Customs customs tag1
union   union   tag2
in  in  tag3
danger  danger  tag4
of  of  tag5
the the tag6
</head>
<head p="80%">
New new tag7
restrictions    restriction tag8
in  in  tag3
the the tag6
.
Hi  hi  tag8

假设用户搜索引理“in”。我想要“in”的频率和“in”之前和之后的引理频率。所以我想要整个语料库中“union”、“danger”、“restriction”和“the”的频率。结果应该是：

union    1  
danger   1 
restriction    1  
the    2

我怎么做？我尝试使用lemma_counter = {}，但它不起作用。

我对python语言没有经验，所以如果我有任何错误，请纠正我。

c = open("corpus.vert")

corpus = []

for line in c:
    if not line.startswith("<"):
        corpus.append(line)

lemma = raw_input("Lemma you are looking for: ")

counter = 0
lemmas_before_after = []       
for i in range(len(corpus)):
    parsed_line = corpus[i].split("\t")
    if len(parsed_line) > 1:
        if parsed_line[1] == lemma: 
            counter += 1    #this counts lemma frequency


            new_list = []

            for j in range(i-1, i+2):
                if j < len(corpus) and j >= 0:
                    parsed_line_with_context = corpus[j].split("\t")
        found_lemma = parsed_line_with_context[0].replace("\n","")
        if len(parsed_line_with_context) > 1:
            if lemma != parsed_line_with_context[1].replace("\n",""):                        
            lemmas_before_after.append(found_lemma)        
        else:
            lemmas_before_after.append(found_lemma)                  

print "list of lemmas ", lemmas_before_after


lemma_counter = {}
for i in range(len(corpus)):
    for lemma in lemmas_before_after:
        if parsed_line[1] == lemma:
            if lemma in lemma_counter:
                lemma_counter[lemma] += 1
            else:
                lemma_counter[lemma] = 1

print lemma_counter


fA = counter
print "lemma frequency: ", fA

score 0 · Accepted Answer

这应该可以帮助您完成 80% 的工作。

# Let's use some useful pieces of the awesome standard library
from collections import namedtuple, Counter

# Define a simple structure to hold the properties of each entry in corpus
CorpusEntry = namedtuple('CorpusEntry', ['word', 'lemma', 'tag'])

# Use a context manager ("with...") to automatically close the file when we no
# longer need it
with open('corpus.vert') as c:
    corpus = []
    for line in c:
        if len(line.strip()) > 1 and not line.startswith('<'):
            # Remove the newline character and split at tabs
            word, lemma, tag = line.strip().split('\t')
            # Put the obtained values in the structure
            entry = CorpusEntry(word, lemma, tag)
            # Put the structure in the corpus list
            corpus.append(entry)

# It's practical to wrap the counting in a function
def get_frequencies(lemma):
    # Create a set of indices at which the lemma occurs in corpus. We use a
    # set because it is more efficient for the next part, checking if some
    # index is in this set
    lemma_indices = set()
    # Loop over corpus without manual indexing; enumerate provides information
    # about the current index and the value (some CorpusEntry added earlier).
    for index, entry in enumerate(corpus):
        if entry.lemma == lemma:
            lemma_indices.add(index)

    # Now that we have the indices at which the lemma occurs, we can loop over
    # corpus again and for each entry check if it is either one before or
    # one after the lemma. If so, add the entry's lemma to a new set.
    related_lemmas = set()
    for index, entry in enumerate(corpus):
        before_lemma = index+1 in lemma_indices
        after_lemma = index-1 in lemma_indices
        if before_lemma or after_lemma:
            related_lemmas.add(entry.lemma)

    # Finally, we need to count the number of occurrences of those related
    # lemmas
    counter = Counter()
    for entry in corpus:
        if entry.lemma in related_lemmas:
            counter[entry.lemma] += 1

    return counter

print get_frequencies('in')
# Counter({'the': 2, 'union': 1, 'restriction': 1, 'danger': 1})

它可以写得更简洁（下），算法也可以改进，尽管它仍然是 O(n)；关键是使它易于理解。

对于那些感兴趣的人：

with open('corpus.vert') as c:
    corpus = [CorpusEntry(*line.strip().split('\t')) for line in c
              if len(line.strip() > 1) and not line.startswith('<')]

def get_frequencies(lemma):
    lemma_indices = {index for index, entry in enumerate(corpus)
                     if entry.lemma == lemma}
    related_lemmas = {entry.lemma for index, entry in enumerate(corpus)
                      if lemma_indices & {index+1, index-1}}
    return Counter(entry.lemma for entry in corpus
                   if entry.lemma in related_lemmas)

这是一种更加程序化的样式，其运行速度大约是原来的三倍：

def get_frequencies(lemma):
    counter = Counter()
    related_lemmas = set()
    for index, entry in enumerate(corpus):
        counter[entry.lemma] += 1
        if entry.lemma == lemma:
            if index > 0:
                related_lemmas.add(corpus[index-1].lemma)
            if index < len(corpus)-1:
                related_lemmas.add(corpus[index+1].lemma)
    return {lemma: frequency for lemma, frequency in counter.iteritems()
            if lemma in related_lemmas}

python - 如何从整个文件的列表中计算词频？

1 回答 1

Related

Reference