TL;博士
用于collections.Counter
获取数据框中列中唯一单词的计数(无停用词)
鉴于:
$ cat test.csv
Description
crazy mind california medical service data base...
california licensed producer recreational & medic...
silicon valley data clients live beyond status...
mycrazynotes inc. announces $144.6 million expans...
leading provider sustainable energy company prod ...
livefreecompany founded 2005, listed new york stock...
代码:
from collections import Counter
from string import punctuation
import pandas as pd
from nltk.corpus import stopwords
from nltk import word_tokenize
stoplist = set(stopwords.words('english') + list(punctuation))
df = pd.read_csv("test.csv", sep='\t')
texts = df['Description'].str.lower()
word_counts = Counter(word_tokenize('\n'.join(texts)))
word_count.most_common()
[出去]:
[('...', 6), ('california', 2), ('data', 2), ('crazy', 1), ('mind', 1), ('medical', 1), ('service', 1), ('base', 1), ('licensed', 1), ('producer', 1), ('recreational', 1), ('&', 1), ('medic', 1), ('silicon', 1), ('valley', 1), ('clients', 1), ('live', 1), ('beyond', 1), ('status', 1), ('mycrazynotes', 1), ('inc.', 1), ('announces', 1), ('$', 1), ('144.6', 1), ('million', 1), ('expans', 1), ('leading', 1), ('provider', 1), ('sustainable', 1), ('energy', 1), ('company', 1), ('prod', 1), ('livefreecompany', 1), ('founded', 1), ('2005', 1), (',', 1), ('listed', 1), ('new', 1), ('york', 1), ('stock', 1)]