python - 计算多个文件（文本）中的两个值（总字数和唯一值）并在 Python 中输出一个 csv

Question

我使用各种文本文件集合，我想知道各种事情，比如字数等。我有成功完成此任务的代码，现在我想在我的工作流程中引入一个脚本，它将按自己的方式工作通过一个目录并编译那里的文本文件的统计信息。

这是我的草稿：

#! /usr/bin/env python

# Get from each text file a total word count and a unique word count.
# Output a CSV with three columns: filename, total, unique.

import glob

with open (file_name) as f, open ('countfile.csv', 'w') as out :
    list_of_files = glob.glob('./*.txt)
    for file_name in list_of_files:

        ???

        out.write('{f},{t},{u}\n'.format(f =file_name, t =word_total, u =uniques)

上面的问号是我想对每个文件执行的操作的占位符，代码如下：

# Total No. of Words        
word_list = re.split('\s+', textfile.read().lower())
word_total = len(word_list)

# Unique Words
freq_dic = {}
punctuation = re.compile(r'[.?!,":;]') 
for word in word_list:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
    try: 
        freq_dic[word] += 1
    except: 
        freq_dic[word] = 1

uniques = len(freq_dic)

我不太清楚如何将所有这些代码插入到上面的代码中。我不知何故怀疑这行不通，但我不知道如何继续。在这里的任何帮助将不胜感激。如果我能弄清楚这一点，那么我想我可能真的能够自动化很多事情。

我知道第二个代码块可能不是最漂亮的，但它尽可能紧凑，并且仍然理解它在做什么。毫无疑问，我已经开始学习 Python。

编辑澄清：

我拥有的是一个文本目录：

text1.txt  
text2.txt  
text3.txt

我想要的是将此脚本指向该目录并让它遍历所有文本并输出具有以下形式的 CSV 文件：

text1, 345, 123
text2, 1025, 318
text3, 765, 245

（请.txt注意，不需要删除文件名。）

score 3 · Accepted Answer

files = {}
for fpath in glob.glob("*.txt"):
    with open(fpath) as f:
         fixed_text = re.sub("[^a-zA-Z'-]"," ",f.read())
    words = fixed_text.split()
    total_words = len(words)
    total_unique = len(set(words))
    files[fpath] = (total_words, total_unique)
    print "Total words:", total_words
    print "Total unique:", total_unique

with open("some_csv.csv", "w") as f:
    for fname in files:
        print >> f, "%s,%s,%s" % (fname, files[fname][0], files[fname][1])

我认为这应该工作......

python - 计算多个文件（文本）中的两个值（总字数和唯一值）并在 Python 中输出一个 csv

1 回答 1

Related

Reference