python-3.x - 使用 treetaggerwrapper 为每个句子提供提取的引理不起作用：返回单词列表而不是每个句子的单词列表

Question

这是我的函数，它应该对句子列表进行词形还原，但输出是所有单词的列表，而不是每个词形还原句子的列表。

lemmatize 函数的代码

tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr') 
def lemmatize(corpus):
    lemmatize_list_of _sentences= []
    lemmatize_list_of _sentences2 = []
    for sentence in corpus:
        tags = tagger.tag_text(sentence)
        tags2 = treetaggerwrapper.make_tags(tags, allow_extra = True)
        lemmatize_list_of_sentences.append(tags2)
        print(lemmatize_list_of_sentences)
        for subl in lemmatize_list_of_sentences: # loop in list of sublists 
            for word in subl:
                if word.__class__.__name__ == "Tag":
                    lemme=word[2] #  I want also to check if lemme[2] is empty and add this 
                    lemmeOption2=lemme.split("|")
                    lemme=lemmeOption1[0]
                    lemmatize_list_of_sentences2.append(lemme)


    return lemmatize_list_of_sentences2 # should return a list of lists where each list contains the lemme retrieve



lemmatize_train= lemmatize(sentences_train_remove_stop_words)
lemmatize_test= lemmatize(sentences_test_remove_stop_words)
print(lemmatize_train)

此外，我想在 lemmatize 函数中添加一行代码来检查 index(2) 或 (-1) 是否为空，如果为空，则检索第一个索引处的单词

我想出了这个，但我怎样才能将它与我的 lemmatize 函数结合起来

for word in subl:
        lemme= word.split('\t')
        try:
            if lemme[2] == '':
                lemmatize_list_of _sentences2.append(parts[0])
            else:
                lemmatize_list_of _sentences2.append(parts[2])
        except:
            print(parts)

file_input 中的句子列表

La période de rotation de la Lune est la même que sa période orbitale et elle présente donc toujours le même hémisphère. 
Cette rotation synchrone résulte des frottements qu’ont entraînés les marées causées par la Terre.

标记文本并打印 sentence_tagging 列表后，我有这个：

第一句话：

[[Tag(word='la', pos='DET:ART', lemma='le'), Tag(word='période', pos='NOM', lemma='période'), Tag(word='rotation', pos='NOM', lemma='rotation'), Tag(word='lune', pos='NOM', lemma='lune'), Tag(word='période', pos='NOM', lemma='période'), Tag(word='orbitale', pos='ADJ', lemma='orbital'), Tag(word='présente', pos='VER:pres', lemma='présenter'), Tag(word='donc', pos='ADV', lemma='donc'), Tag(word='toujours', pos='ADV', lemma='toujours')]]

整句：

[[Tag(word='la', pos='DET:ART', lemma='le'), Tag(word='période', pos='NOM', lemma='période'), Tag(word='rotation', pos='NOM', lemma='rotation'), Tag(word='lune', pos='NOM', lemma='lune'), Tag(word='période', pos='NOM', lemma='période'), Tag(word='orbitale', pos='ADJ', lemma='orbital'), Tag(word='présente', pos='VER:pres', lemma='présenter'), Tag(word='donc', pos='ADV', lemma='donc'), Tag(word='toujours', pos='ADV', lemma='toujours')], [Tag(word='cette', pos='PRO:DEM', lemma='ce'), Tag(word='rotation', pos='NOM', lemma='rotation'), Tag(word='synchrone', pos='ADJ', lemma='synchrone'), Tag(word='résulte', pos='VER:pres', lemma='résulter'), Tag(word='frottements', pos='NOM', lemma='frottement'), Tag(word='entraînés', pos='VER:pper', lemma='entraîner'), Tag(word='les', pos='DET:ART', lemma='le'), Tag(word='marées', pos='NOM', lemma='marée'), Tag(word='causées', pos='VER:pper', lemma='causer')]]

检索引理后，我有一个 word 列表，这不是我所期望的。期望每个句子的列表。

输出：

['le', 'période', 'rotation', 'lune', 'période', 'orbital', 'présenter', 'donc', 'toujours', 'ce', 'rotation', 'synchrone', 'résulter', 'frottement', 'entraîner', 'le', 'marée', 'causer']

预期：将句子的每个单词放在一个字符串中，单词之间有空格。


['le période rotation lune période orbital présenter donc toujours','ce rotation synchrone résulter frottement entraîner le marée causer']

score 1 · Accepted Answer

所以你想要两个标签列表。

您正在返回一个简单的列表，您必须确保您正在返回一个列表列表。

tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr') 
def lemmatize(corpus):
    lemmatize_list_of_sentences= []
    lemmatize_list_of_sentences2 = []
    for sentence in corpus:
        tags = tagger.tag_text(sentence)
        tags2 = treetaggerwrapper.make_tags(tags, allow_extra = True)
        lemmatize_list_of_sentences.append(tags2)
        print(lemmatize_list_of_sentences)
        for subl in lemmatize_list_of_sentences: # loop in list of sublists
            #Here you create a list to work as a "inner" sentence list.
            sentence_lemmas = []
            for word in subl:
                if word.__class__.__name__ == "Tag":
                    lemme=word[2] #  I want also to check if lemme[2] is empty and add this 
                    lemmeOption2=lemme.split("|")
                    lemme=lemmeOption2[0] #There was a typo here
                    sentence_lemmas.append(lemme) #Here you append the lemma extracted
            # Here you change the original list in order for it to receive the "inner" list.
            lemmatize_list_of_sentences2.append(sentence_lemmas)


    return lemmatize_list_of_sentences2 # should return a list of lists where each list contains the lemme retrieve



lemmatize_train= lemmatize(sentences_train_remove_stop_words)
lemmatize_test= lemmatize(sentences_test_remove_stop_words)
print(lemmatize_train)

检查标签是否为空

此外，从文档（Tree tagger wraper docs）来看，“Tag”是一个“命名元组”。

您可以在这篇文章中了解更多关于“命名元组”的信息。

但是，基本上，您可以像引用对象一样引用“标签”属性，起诉 . （点）符号。

因此，要检查引理是否为空，您可以执行以下操作：

if word.lemma != "":
   lemme = word.lemma
else:
   lemme = word.word.split("/")

加入列表

另外，如果您想最后重新加入引理列表，请执行以下操作：

joined_sentences = []
for lemma_list in lemmatize_train:
   joined_sentences.append(" ".join(lemma_list))

print(joined_sentences)

返回连接字符串的函数：

def lemmatize(corpus):
        lemmatize_list_of_sentences= []
        lemmatize_list_of_sentences2 = []
        for sentence in corpus:
            tags = tagger.tag_text(sentence)
            tags2 = treetaggerwrapper.make_tags(tags, allow_extra = True)
        lemmatize_list_of_sentences.append(tags2)
        print(lemmatize_list_of_sentences)
        for subl in lemmatize_list_of_sentences: # loop in list of sublists
            #Here you create a list to work as a "inner" sentence list.
            sentence_lemmas = []
            for word in subl:
                if word.__class__.__name__ == "Tag":
                    lemme=word[2] #  I want also to check if lemme[2] is empty and add this 
                    lemmeOption2=lemme.split("|")
                    lemme=lemmeOption2[0] #There was a typo here
                    sentence_lemmas.append(lemme) #Here you append the lemma extracted

            lemmatize_list_of_sentences2.append(sentence_lemmas)
    joined_sentences= []
    for lemma_list in lemmatize_list_of_sentences2:
       joined_sentences.append(" ".join(lemma_list))
    return joined_sentences

希望现在很清楚。

python-3.x - 使用 treetaggerwrapper 为每个句子提供提取的引理不起作用：返回单词列表而不是每个句子的单词列表

1 回答 1

Related

Reference