python - 是否可以从 Pandas Profiling 中获得详细的词频列表？

Question

我目前正在处理大量文件，这些文件需要我检查某些字符串的频率。我的第一个想法是将所有文件导入单个数据集，并使用 for 循环使用以下代码检查所有文件中的字符串。

 # Define an empty dataframe to append all imported files to
df = pd.DataFrame()
new_list = []

# If text file is import successfully append the resulting dataframe to df. If an exception occurs append "None" instead.
# "`" was chosen as the delimiter to ensure that each file is saved to a single row.
for i in file_list: 
    try: df_1 = pd.read_csv(f"D:/Admin/3. OCR files/OCR_Translations/{i}", delimiter = "`") 
    df = df.append(df_1) new_list.append(f"D:/Admin/3. OCR files/OCR_Translations/{i}") 
except: 
    df = df.append(["None"])                
    new_list.append("None")

df = df.T.reset_index()

# Search the dataset for the required keyword
count = 0

for i in df["index"]:
    if "Keyword1" in i:
        count += 1

这最终失败了，因为绝对零保证这些文件中的字符串将被正确拼写，因为有问题的文件是由 OCR 程序生成的（并且有问题的文件是泰语）。

Pandas Profiling 准确地生成了我手头工作所需的内容，但它没有提供完整列表，如此链接 ( https://imgur.com/xxf1Qnx ) 中所示。有没有办法从 Pandas Profiling 中获取完整的词频列表？我试过检查 pandas_profiling 文档（https://pandas-profiling.github.io/pandas-profiling/docs/master/index.html），看看是否有什么我可以做的，到目前为止我还没有看到任何相关的到我这里的用例。

score 2 · Accepted Answer

你~~也许不会~~不需要 Pandas 来计算文件中单词的出现次数。

import collections

word_counter = collections.Counter()

for i in file_list:
    with open(f"D:/Admin/3. OCR files/OCR_Translations/{i}") as f:
        for line in f:
            words = line.strip().split()  # Split line by whitespaces.
            word_counter.update(words)  # Update counter with occurrences.


print(word_counter)

您可能还.most_common()对 Counters 上的方法感兴趣。

另外，如果你真的需要，你也可以把它Counter变成一个数据框；它只是一个具有特殊效果的字典。

python - 是否可以从 Pandas Profiling 中获得详细的词频列表？

1 回答 1

Related

Reference