python - 从文本文件创建术语文档矩阵

Question

我正在尝试读取一个文本文件并使用文本挖掘包创建一个术语文档矩阵。我可以创建术语文档矩阵，我需要逐行添加每一行。问题是我想一次包含整个文件。我在以下代码中缺少什么？提前感谢您的任何建议？

import textmining

def term_document_matrix_roy_1():

    '''-----------------------------------------'''
    with open("data_set.txt") as f:
        reading_file_line = f.readlines() #entire content, return  list 
        print reading_file_line #list
        reading_file_info = [item.rstrip('\n') for item in reading_file_line]
        print reading_file_info
        print reading_file_info [1] #list-1
        print reading_file_info [2] #list-2

        '''-----------------------------------------'''
        tdm = textmining.TermDocumentMatrix()
        #tdm.add_doc(reading_file_info) #Giving error because of readlines 
        tdm.add_doc(reading_file_info[0])       
        tdm.add_doc(reading_file_info[1])
        tdm.add_doc(reading_file_info[2])


        for row in tdm.rows(cutoff=1):
            print row

示例文本文件：“data_set.txt”包含以下信息：

让我们写一些python代码

到目前为止，本书主要讨论了即席检索的过程。

在此过程中，我们将学习一些重要的机器学习技术。

输出将是 Term Document Matrix，基本上是一个特定单词出现了多少次。输出图像： http: //postimg.org/image/eidddlkld/

在此处输入图像描述

score 2 · Accepted Answer

If I'm understanding you correctly, you're currently adding each line of your file as a separate document. To add the whole file, you could just concatenate the lines, and add them all at once.

tdm = textmining.TermDocumentMatrix()
#tdm.add_doc(reading_file_info) #Giving error because of readlines 
tdm.add_doc(' '.join(reading_file_info))

If you are looking for multiple matrices, you'll end up getting only one row in each, as there is only one document, unless you have another way of splitting the line in to separate documents. You may want to re-think whether this is what you actually want. Nevertheless, I think this code will do it for you:

with open("txt_files/input_data_set.txt") as f:
    tdms = []
    for line in f:
        tdm = textmining.TermDocumentMatrix()
        tdm.add_doc(line.strip())
        tdms.append(tdm)

    for tdm in tdms:
        for row in tdm.rows(cutoff=1):
            print row

I haven't really been able to test this code, so the output might not be right. Hopefully it will get you on your way.

score 1 · Accepted Answer

@Fred 感谢您的回复。我想显示我在图像文件中显示的内容。实际上，我可以使用以下代码产生相同的结果，但我希望每一行作为单独的矩阵而不是一个矩阵。

with open("txt_files/input_data_set.txt") as f:
    reading_file_info = f.read()#reading lines exact content
    reading_file_info=f.read 
    tdm = textmining.TermDocumentMatrix()
    tdm.add_doc(reading_file_info)

    tdm.write_csv('txt_files/input_data_set_result.txt', cutoff=1)
    for row in tdm.rows(cutoff=1):
        print row

我正在尝试的是读取文本文件并创建术语文档矩阵。

python - 从文本文件创建术语文档矩阵

2 回答 2

Related

Reference