python - python - 如何从python中的输入文本中计算单个单词的频率和双字计数？

Question

您好，我想从 python 中的输入文本中计算单个字数和双字数。前任。

"what is your name ? what you want from me ?
 You know best way to earn money is Hardwork 
 what is your aim ?"

输出：

sinle W.C. : 
what   3
 is    3
 your  2
you    2

等等..

Double W.C. :
what is 2
is your 2
your name 1
what you 1

等等..请张贴这样做的方法？我使用以下代码进行单字计数：

ws={}

对于文本中的行：

for wrd in line:

    if wrd not in ws:

        ws[wrd]=1

    else:

        ws[wrd]+=1

score 3 · Accepted Answer

from collections import Counter

s = "..."

words = s.split()
pairs = zip(words, words[1:])

single_words, double_words = Counter(words), Counter(pairs)

Output:

print "sinle W.C."
for word, count in sorted(single_words.items(), key=lambda x: -x[1]):
    print word, count

print "double W.C."
for pair, count in sorted(double_words.items(), key=lambda x: -x[1]):
    print pair, count

score 2 · Accepted Answer

import nltk
from nltk import bigrams
from nltk import trigrams

tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1]
bi_tokens = bigrams(tokens)

print [(item, tokens.count(item)) for item in sorted(set(tokens))]
print [(item, bi_tokens.count(item)) for item in sorted(set(bi_tokens))]

score 0 · Accepted Answer

这行得通。使用默认字典。蟒蛇2.6

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> string = "what is your name ? what you want from me ?\n
    You know best way to earn money is Hardwork\n what is your aim ?"
>>> l = string.split()
>>> for i in l:
    d[i]+=1

>>> d
defaultdict(<type 'int'>, {'me': 1, 'aim': 1, 'what': 3, 'from': 1, 'name': 1, 
    'You': 1, 'money': 1, 'is': 3, 'earn': 1, 'best': 1, 'Hardwork': 1, 'to': 1, 
    'way': 1, 'know': 1, 'want': 1, 'you': 1, 'your': 2, '?': 3})
>>> d2 = defaultdict(int)
>>> for i in zip(l[:-1], l[1:]):
    d2[i]+=1

>>> d2
defaultdict(<type 'int'>, {('You', 'know'): 1, ('earn', 'money'): 1, 
    ('is', 'Hardwork'): 1, ('you', 'want'): 1, ('know', 'best'): 1, 
    ('what', 'is'): 2, ('your', 'name'): 1, ('from', 'me'): 1, 
    ('name', '?'): 1, ('?', 'You'): 1, ('?', 'what'): 1, ('to', 'earn'): 1, 
    ('aim', '?'): 1, ('way', 'to'): 1, ('Hardwork', 'what'): 1, 
    ('money', 'is'): 1, ('me', '?'): 1, ('what', 'you'): 1, ('best', 'way'): 1,
    ('want', 'from'): 1, ('is', 'your'): 2, ('your', 'aim'): 1})
>>>

score 0 · Accepted Answer

我意识到这个问题已经有几年了。我今天写了一个小程序来计算单词文档（docx）中的单个单词。我使用 docx2txt 从 word 文档中获取文本，并使用我的第一个正则表达式来删除除字母、数字或空格以外的所有字符，并将所有字符切换为大写。我把这个放进去是因为这个问题没有得到回答。

这是我的小测试程序，以防它可以帮助任何人。

mydoc = 'I:/flashdrive/pmw/pmw_py.docx'

words_all = {}

#####

import docx2txt

my_text = docx2txt.process(mydoc)
print(my_text)

my_text_org = my_text

import re

    #added this code for the double words

from collections import Counter

pairs = zip(words, words[1:])
pair_list = Counter(pairs)

print('before pair listing')

for pair, count in sorted(pair_list.items(), key=lambda x: -x[1]):
   #print (''.join('{} {}'.format(*pair)), count) #worked
   #print(' '.join(pair), '', count) #worked  
  
   new_pair = ("{} {}")
   my_pair = new_pair.format(pair[0],pair[1])
   print ((my_pair), ": ", count)
  
#end of added code

my_text = re.sub('[\W_]+', ' ', my_text.upper(), flags=re.UNICODE)
print(my_text)

words = my_text.split()

words_org = words #just in case I may need the original version later


for i in words:  
     if not i in words_all:
         words_all[i] = words.count(i)
          
  
for k,v in sorted(words_all.items()):
     print(k, v)

print("Number of items in word list: {}".format(len(words_all)))

python - python - 如何从python中的输入文本中计算单个单词的频率和双字计数？

4 回答 4

Related

Reference