我正在尝试编写一个函数,该函数将返回一个 NLTK 定义列表,用于从文本文档中标记化的“标记”,受词性约束。
我首先将 nltk.pos_tag 给出的标签转换为 wordnet.synsets 使用的标签,然后依次应用 .word_tokenize()、.pos_tag()、.synsets,如下代码所示:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
#convert the tag to the one used by wordnet.synsets
def convert_tag(tag):
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
return tag_dict[tag[0]]
except KeyError:
return None
#tokenize, tag, and find synsets (give the first match between each 'token' and 'word net_tag')
def doc_to_synsets(doc):
token = nltk.word_tokenize(doc)
tag = nltk.pos_tag(token)
wordnet_tag = convert_tag(tag)
syns = wn.synsets(token, wordnet_tag)
return syns[0]
#test
doc = 'document is a test'
doc_to_synsets(doc)
如果编程正确,它应该返回类似
[Synset('document.n.01'), Synset('be.v.01'), Synset('test.n.01')]
但是,Python 会抛出错误消息:
'list' object has no attribute 'lower'
我还注意到,在错误消息中,它说
lemma = lemma.lower()
这是否意味着我还需要像以前的线程所建议的那样“对我的令牌进行“词形化”?或者我应该在做所有这些之前在文本文档上应用 .lower() 吗?
我对 wordnet 比较陌生,真的不知道是 .synsets 导致了问题还是 nltk 部分有问题。如果有人能就此启发我,将不胜感激。
谢谢你。
[编辑] 错误回溯
AttributeError Traceback (most recent call last)
<ipython-input-49-5bb011808dce> in <module>()
22 return syns
23
---> 24 doc_to_synsets('document is a test.')
25
26
<ipython-input-49-5bb011808dce> in doc_to_synsets(doc)
18 tag = nltk.pos_tag(token)
19 wordnet_tag = convert_tag(tag)
---> 20 syns = wn.synsets(token, wordnet_tag)
21
22 return syns
/opt/conda/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py in synsets(self, lemma, pos, lang, check_exceptions)
1481 of that language will be returned.
1482 """
-> 1483 lemma = lemma.lower()
1484
1485 if lang == 'eng':
AttributeError: 'list' object has no attribute 'lower'
因此,在使用@dugup 和 $udiboy1209 建议的代码后,我得到以下输出
[[Synset('document.n.01'),
Synset('document.n.02'),
Synset('document.n.03'),
Synset('text_file.n.01'),
Synset('document.v.01'),
Synset('document.v.02')],
[Synset('be.v.01'),
Synset('be.v.02'),
Synset('be.v.03'),
Synset('exist.v.01'),
Synset('be.v.05'),
Synset('equal.v.01'),
Synset('constitute.v.01'),
Synset('be.v.08'),
Synset('embody.v.02'),
Synset('be.v.10'),
Synset('be.v.11'),
Synset('be.v.12'),
Synset('cost.v.01')],
[Synset('angstrom.n.01'),
Synset('vitamin_a.n.01'),
Synset('deoxyadenosine_monophosphate.n.01'),
Synset('adenine.n.01'),
Synset('ampere.n.02'),
Synset('a.n.06'),
Synset('a.n.07')],
[Synset('trial.n.02'),
Synset('test.n.02'),
Synset('examination.n.02'),
Synset('test.n.04'),
Synset('test.n.05'),
Synset('test.n.06'),
Synset('test.v.01'),
Synset('screen.v.01'),
Synset('quiz.v.01'),
Synset('test.v.04'),
Synset('test.v.05'),
Synset('test.v.06'),
Synset('test.v.07')],
[]]
现在的问题归结为从列表“syns”中提取每个列表的第一个匹配项(或第一个元素)并将它们放入一个新列表中。对于试用文档 'document is a test',它应该返回:
[Synset('document.n.01'), Synset('be.v.01'), Synset('angstrom.n.01'), Synset('trial.n.02')]
这是文本文档中每个标记的第一个匹配项的列表。