16

我有一个字符串“ Hello I am going to I with hello am”。我想找出一个单词在字符串中出现了多少次。示例 hello 出现 2 次。我尝试了这种只打印字符的方法-

def countWord(input_string):
    d = {}
    for word in input_string:
        try:
            d[word] += 1
        except:
            d[word] = 1

    for k in d.keys():
        print "%s: %d" % (k, d[k])
print countWord("Hello I am going to I with Hello am")

我想学习如何找到字数。

4

9 回答 9

42

如果要查找单个单词的计数,只需使用count

input_string.count("Hello")

使用collections.Counterandsplit()来计算所有单词:

from collections import Counter

words = input_string.split()
wordCount = Counter(words)
于 2012-07-02T20:05:03.927 回答
6

Counterfrom collections是你的朋友:

>>> from collections import Counter
>>> counts = Counter(sentence.lower().split())
于 2012-07-02T20:05:06.327 回答
5
from collections import *
import re

Counter(re.findall(r"[\w']+", text.lower()))

usingre.findall比 更加通用split,因为否则您无法考虑诸如“don't”和“I'll”等缩略词。

演示(使用您的示例):

>>> countWords("Hello I am going to I with hello am")
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

如果您希望进行许多这样的查询,这只会做一次 O(N) 工作,而不是 O(N*#queries) 工作。

于 2012-07-02T20:05:02.520 回答
3

单词出现次数的向量称为bag-of-words

Scikit-learn 提供了一个很好的模块来计算它,sklearn.feature_extraction.text.CountVectorizer. 例子:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             min_df = 0,          \
                             max_features = 50) 

text = ["Hello I am going to I with hello am"]

# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

输出:

2 am
1 going
2 hello
1 to
1 with

部分代码取自本Kaggle 关于 bag-of-words 的教程

仅供参考:如何使用 sklearn 的 CountVectorizerand() 来获取包含任何标点符号作为单独标记的 ngram?

于 2015-08-11T23:40:15.700 回答
2

这是另一种不区分大小写的方法

sum(1 for w in s.lower().split() if w == 'Hello'.lower())
2

它通过将字符串和目标转换为小写来匹配。

ps:也解决了下面@DSM指出的"am ham".count("am") == 2问题:)str.count()

于 2012-07-02T20:05:19.713 回答
2

考虑Hellohello作为相同的词,无论它们的情况如何:

>>> from collections import Counter
>>> strs="Hello I am going to I with hello am"
>>> Counter(map(str.lower,strs.split()))
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})
于 2012-07-02T20:14:35.970 回答
1

You can divide the string into elements and calculate their number

count = len(my_string.split())

于 2020-01-23T10:02:52.910 回答
0

您可以使用 Python 正则表达式库re来查找子字符串中的所有匹配项并返回数组。

import re

input_string = "Hello I am going to I with Hello am"

print(len(re.findall('hello', input_string.lower())))

印刷:

2
于 2016-09-09T20:06:11.033 回答
0
def countSub(pat,string):
    result = 0
    for i in range(len(string)-len(pat)+1):
          for j in range(len(pat)):
              if string[i+j] != pat[j]:
                 break
          else:   
                 result+=1
    return result
于 2018-11-01T19:45:55.527 回答