python - 从 Pandas 数据框中计算不同的单词

Question

我有一个 Pandas 数据框，其中一列包含文本。我想获得整个列中出现的唯一单词列表（空格是唯一的拆分）。

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']

df=pd.DataFrame(r1,columns=['text'])

输出应如下所示：

['my','nickname','is','ft.jgt','someone','going','to','place']

计数也没有什么坏处，但这不是必需的。

score 81 · Accepted Answer

使用 aset创建唯一元素的序列。

进行一些清理df以获取小写字符串并拆分：

df['text'].str.lower().str.split()
Out[43]: 
0             [my, nickname, is, ft.jgt]
1    [someone, is, going, to, my, place]

此列中的每个列表都可以传递给set.update函数以获取唯一值。用于apply这样做：

results = set()
df['text'].str.lower().str.split().apply(results.update)
print(results)

set(['someone', 'ft.jgt', 'my', 'is', 'to', 'going', 'place', 'nickname'])

或与Counter()评论一起使用：

from collections import Counter
results = Counter()
df['text'].str.lower().str.split().apply(results.update)
print(results)

score 25 · Accepted Answer

使用collections.Counter：

>>> from collections import Counter
>>> r1=['My nickname is ft.jgt','Someone is going to my place']
>>> Counter(" ".join(r1).split(" ")).items()
[('Someone', 1), ('ft.jgt', 1), ('My', 1), ('is', 2), ('to', 1), ('going', 1), ('place', 1), ('my', 1), ('nickname', 1)]

score 24 · Accepted Answer

如果您想从 DataFrame 构造中执行此操作：

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']

df=pd.DataFrame(r1,columns=['text'])

df.text.apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)

My          1
Someone     1
ft.jgt      1
going       1
is          2
my          1
nickname    1
place       1
to          1
dtype: float64

如果您想要更灵活的标记化使用nltk及其tokenize

score 11 · Accepted Answer

以@Ofir Israel 的回答为基础，专门针对 Pandas：

from collections import Counter
result = Counter(" ".join(df['text'].values.tolist()).split(" ")).items()
result

会给你你想要的，这会将文本列系列值转换为列表，拆分空格并计算实例。

score 5 · Accepted Answer

5

uniqueWords = list(set(" ".join(r1).lower().split(" ")))
count = len(uniqueWords)

于 2013-09-21T19:59:29.610 回答

score 1 · Accepted Answer

除了讨论之外，这里是 92816 行数据帧上三个提议的解决方案（跳过转换到列表）的时间安排：

from collections import Counter
results = set()

%timeit -n 10 set(" ".join(df['description'].values.tolist()).lower().split(" "))

每个循环 323 毫秒 ± 4.46 毫秒（平均值 ± 标准偏差。7 次运行，每次 10 次循环）

%timeit -n 10 df['description'].str.lower().str.split(" ").apply(results.update)

每个循环 316 毫秒 ± 4.22 毫秒（平均值 ± 标准偏差。7 次运行，每次 10 次循环）

%timeit -n 10 Counter(" ".join(df['description'].str.lower().values.tolist()).split(" "))

每个循环 365 毫秒 ± 2.5 毫秒（平均值 ± 标准偏差。7 次运行，每次 10 次循环）

len(list(set(" ".join(df['description'].values.tolist()).lower().split(" "))))

13561

len(results)

13561

len(Counter(" ".join(df['description'].str.lower().values.tolist()).split(" ")).items())

13561

我也尝试了仅使用 Pandas 的方法，但它花费了更长的时间，并且使用了 > 25GB 的 RAM 使我的 32GB 笔记本电脑交换。

所有其他人都非常快。如果需要字数统计，我会使用解决方案 1 作为单行，或者使用 3。

score 0 · Accepted Answer

TL;博士

用于collections.Counter获取数据框中列中唯一单词的计数（无停用词）

鉴于：

$ cat test.csv 
Description
crazy mind california medical service data base...
california licensed producer recreational & medic...
silicon valley data clients live beyond status...
mycrazynotes inc. announces $144.6 million expans...
leading provider sustainable energy company prod ...
livefreecompany founded 2005, listed new york stock...

代码：

from collections import Counter
from string import punctuation

import pandas as pd

from nltk.corpus import stopwords
from nltk import word_tokenize

stoplist = set(stopwords.words('english') + list(punctuation))

df = pd.read_csv("test.csv", sep='\t')

texts = df['Description'].str.lower()

word_counts = Counter(word_tokenize('\n'.join(texts)))

word_count.most_common()

[出去]：

[('...', 6), ('california', 2), ('data', 2), ('crazy', 1), ('mind', 1), ('medical', 1), ('service', 1), ('base', 1), ('licensed', 1), ('producer', 1), ('recreational', 1), ('&', 1), ('medic', 1), ('silicon', 1), ('valley', 1), ('clients', 1), ('live', 1), ('beyond', 1), ('status', 1), ('mycrazynotes', 1), ('inc.', 1), ('announces', 1), ('$', 1), ('144.6', 1), ('million', 1), ('expans', 1), ('leading', 1), ('provider', 1), ('sustainable', 1), ('energy', 1), ('company', 1), ('prod', 1), ('livefreecompany', 1), ('founded', 1), ('2005', 1), (',', 1), ('listed', 1), ('new', 1), ('york', 1), ('stock', 1)]

score -4 · Accepted Answer

如果 Dataframe 有“a”、“b”、“c”等列并且要计算每列的不同单词，那么你可以使用，

Counter(dataframe['a']).items()

python - 从 Pandas 数据框中计算不同的单词

8 回答 8

TL;博士

Related

Reference