python-3.x - SPACY NLP中如何进行语料库预处理、词形还原和向量化？

翻译自：https://stackoverflow.com/questions/55875708 2019-04-26T22:41:49.873

319 次

我正在尝试使用 spaCy 对 Jupyter Notebook（Python 3）上的文件夹（带有 .txt 文件）进行标记化、词形还原和矢量化。

以下是我尝试编写的代码，但我可能犯了一个错误。我希望整个文件夹被标记化、词形化和矢量化（不是任何特定的 .txt 文件，而是它的大部分组合）。

#tokenization
    for token in file_list:
        print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

#lemmatisation 
    def show_lemmas(file_list):
        for token in text:
            print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}') 
            show_lemmas(file_list)  

#Vectorization (Using TF-IDF to create a vectorized document term matrix)
    from sklearn.feature_extraction.text import TfidfVectorizer
    tfidf = TfidfVectorizer(max_df=0.95,min_df=2, stop_words='english')
    dtm =tfidf.fit_transform(file_list)
    dtm

我希望代码行能够对文件夹（具有大量 .txt 文件）执行文本矢量化、词形还原和语料库预处理。你能帮我写实现这一目标所需的代码吗？另外，让我知道在进入聚类分析之前是否应该做更多的事情（除了 Vec、Tok 和 Lemm）？

python-3.x - SPACY NLP中如何进行语料库预处理、词形还原和向量化？

0 回答 0

Related

Reference