2

我有一个词的字典。对于字典中的每个键,我想在文章中找到它的频率。

打开文章后,我会

for k, v in sourted_key.items():
    for token in re.findall(k, data)
        token[form] += 1

在 're.findall(k, data)' 中的键必须是字符串。但是这个字典中的键不是。我想搜索密钥。还有其他解决方案吗?请注意,KEYS 包含许多标点符号。

例如,如果键是“手”。它只匹配手。不方便,钱德勒。

4

7 回答 7

6

在 Python 2.7+ 中,您可以使用collections.Counter

import re, collections

text = '''Nullam euismod magna et ipsum tristique suscipit. Aliquam ipsum libero, cursus et rutrum ut, suscipit id enim. Maecenas vel justo dolor. Integer id purus ante. Aliquam volutpat iaculis consectetur. Suspendisse justo sapien, tincidunt ut consequat eget, fringilla id sapien. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Praesent mattis velit vitae libero luctus posuere. Vestibulum ac erat nibh, vel egestas enim. Ut ac eros ipsum, ut mattis justo. Praesent dignissim odio vitae nisl hendrerit sodales. In non felis leo, vehicula aliquam risus. Morbi condimentum nunc sit amet enim rutrum a gravida lacus pharetra. Ut eu nisi et magna hendrerit pharetra placerat vel turpis. Curabitur nec nunc et augue tristique semper.'''

c = collections.Counter(w.lower() for w in re.findall(r'\w+|[.,:;?!]', text))
words = set(('et', 'ipsum', ',', '?'))
for w in words:
  print('%s: %d' % (w, c.get(w, 0)))
于 2012-05-08T15:11:02.727 回答
2
my_text = 'abc,abc,efr,sdgret,er,ttt,'

my_dict = {'abc':0, 'er': 0}

for word in my_text.split(','):
    if word in my_dict:
        my_dict[word] += 1

结果:

>>> my_dict
{'abc': 2, 'er': 1}

编辑:更通用的解决方案

对于普通文章,我们需要使用正则表达式:

import re

my_string = "Wow! Is this true? Really!?!? This is crazy!"
my_dict = {'IS': 0, 'TRUE': 0}

words = re.findall(r'\w+', my_string)
cap_words = [word.upper() for word in words]

for word in cap_words:
    if word in my_dict:
        my_dict[word] += 1

结果:

>>> my_dict
{'IS': 2, 'TRUE': 1}
于 2012-05-08T15:17:43.423 回答
2

我会那样做

tokens = {} 
d= {"a":1,"b":2}
data = "abca"
for k in d.keys():
    tokens[k] = data.count(k)
于 2012-05-08T15:21:13.907 回答
1

尽量re.findall( re.escape( k ), data )确保“单词”中的特殊字符不会引起问题。

但我的猜测是这不是你的问题。的结果findall()是匹配列表,而不是字符串。re.MatchObject没有定义__getitem__哪种方法[form]不起作用。

您可能的意思是counts[token.group()] += 1wherecounts是一个默认值为 0 的字典

于 2012-05-08T15:14:12.970 回答
1

选项 A

import re

text = """Now is the time for all good men to come to the aid of their country.  A man is only as good as all his thoughts."""
words = dict()

for word in re.findall('[^ .;]+', text):
    if words.get(word.lower(), False):
        words[word.lower()] += 1
    else:
        words[word.lower()] = 1

print words

这产生...

{'a': 1, 'all': 2, 'good': 2, 'for': 1, 'their': 1, 'of': 1, 
'is': 2, 'men': 1, 'as': 2, 'country': 1, 'to': 2, 'only': 1, 
'his': 1, 'time': 1, 'aid': 1, 'the': 2, 'now': 1, 'come': 1, 
'thoughts': 1, 'man': 1}

选项 B:使用默认字典

import re
from collections import defaultdict

text = """Now is the time for all good men to come to the aid of their country.  A man is only as good as all his thoughts."""
words = defaultdict(int)

for word in re.findall('[^ .;]+', text):
    words[word.lower()] += 1

print words

这导致...

defaultdict(<type 'int'>, {'a': 1, 'all': 2, 'good': 2, 'for': 1, 
'their': 1, 'of': 1, 'is': 2, 'men': 1, 'as': 2, 'country': 1, 'to': 2, 
'only': 1, 'his': 1, 'time': 1, 'aid': 1, 'the': 2, 'now': 1, 'come': 1, 
'thoughts': 1, 'man': 1})
于 2012-05-08T15:26:01.707 回答
0
article = "I have a dict of words. For each key in the dict, I want to find its frequency in an article"

words = {"dict", "i", "in", "key"} # set of words


wordsFreq = {}

wordsInArticle = tuple(word.lower() for word in atricle.split(" "))

for word in wordsInArticle:
  if word in wordsFreq:
    wordsFreq[word]= wordsFreq[word] + 1 if word in wordsFreq else 1
于 2012-05-08T15:26:07.310 回答
0

由于每个人都在摇摆...

与此不同的是正则表达式将文本与标点符号分开。我用\b\w+\b

import re 

article='''Richard II (13671400) was King of England, a member of the House of Plantagenet and the last of its main-line kings. He ruled from 1377 until he was deposed in 1399. Richard was a son of Edward, the Black Prince, and was born during the reign of his grandfather, Edward III. Richard was tall, good-looking and intelligent. Although probably not insane, as earlier historians believed, he may have suffered from one or several personality disorders that may have become more apparent toward the end of his reign. Less of a warrior than either his father or grandfather, he sought to bring an end to the Hundred Years' War that Edward III had started. He was a firm believer in the royal prerogative, which led him to restrain the power of his nobility and rely on a private retinue for military protection instead. He also cultivated a courtly atmosphere where the king was an elevated figure, and art and culture were at the centre, in contrast to the fraternal, martial court of his grandfather. Richard's posthumous reputation has to a large extent been shaped by Shakespeare, whose play Richard II portrays Richard's misrule and Bolingbroke's deposition as responsible for the 15th-century Wars of the Roses. Most authorities agree that the way in which he carried his policies out was unacceptable to the political establishment, and this led to his downfall.'''
words = {}

for word in re.findall(r'\b\w+\b', article):
    word=word.lower()
    if word in words:
        words[word]+=1
    else:
        words[word]=1    

print [(k,v) for v, k in sorted(((v, k) for k, v in words.items()), reverse=True)] 

打印出按频率排序的 (word, count) 元组列表:

[('the', 15), ('of', 11), ('was', 8), ('and', 8), ('to', 7), ('his', 7), ('he', 7), 
 ('a', 7), ('richard', 6), ('in', 4), ('that', 3), ('s', 3), ('grandfather', 3), 
 ('edward', 3), ('which', 2), ('reign', 2), ('or', 2), ('may', 2), ('led', 2), 
 ('king', 2), ('iii', 2), ('ii', 2), ('have', 2), ('from', 2), ('for', 2), ('end', 2), 
 ('as', 2), ('an', 2), ('years', 1), ('whose', 1), ('where', 1), ('were', 1), ('way', 1), ('wars', 1), ('warrior', 1), ('war', 1), ('until', 1), ('unacceptable', 1), ('toward', 1), ('this', 1), ('than', 1), ('tall', 1), ('suffered', 1), ('started', 1), ('sought', 1), ('son', 1), ('shaped', 1), ('shakespeare', 1), ('several', 1), ('ruled', 1), ('royal', 1), ('roses', 1), ('retinue', 1), ('restrain', 1), ('responsible', 1), ('reputation', 1), ('rely', 1), ('protection', 1), ('probably', 1), ('private', 1), ('prince', 1), ('prerogative', 1), ('power', 1), ('posthumous', 1), ('portrays', 1), ('political', 1), ('policies', 1), ('play', 1), ('plantagenet', 1), ('personality', 1), ('out', 1), ('one', 1), ('on', 1), ('not', 1), ('nobility', 1), ('most', 1), ('more', 1), ('misrule', 1), ('military', 1), ('member', 1), ('martial', 1), ('main', 1), ('looking', 1), ('line', 1), ('less', 1), ('last', 1), ('large', 1), ('kings', 1), ('its', 1), ('intelligent', 1), ('instead', 1), ('insane', 1), ('hundred', 1), ('house', 1), ('historians', 1), ('him', 1), ('has', 1), ('had', 1), ('good', 1), ('fraternal', 1), ('firm', 1), ('figure', 1), ('father', 1), ('extent', 1), ('establishment', 1), ('england', 1), ('elevated', 1), ('either', 1), ('earlier', 1), ('during', 1), ('downfall', 1), ('disorders', 1), ('deposition', 1), ('deposed', 1), ('culture', 1), ('cultivated', 1), ('courtly', 1), ('court', 1), ('contrast', 1), ('century', 1), ('centre', 1), ('carried', 1), ('by', 1), ('bring', 1), ('born', 1), ('bolingbroke', 1), ('black', 1), ('believer', 1), ('believed', 1), ('been', 1), ('become', 1), ('authorities', 1), ('atmosphere', 1), ('at', 1), ('art', 1), ('apparent', 1), ('although', 1), ('also', 1), ('agree', 1), ('15th', 1), ('1399', 1), ('1377', 1), ('13671400', 1)]
于 2012-05-08T15:55:45.150 回答