0

Can CoreNLP determine whether a common noun (as opposed to a proper noun or proper name) refers to a person out-of-the-box? Or if I need to train a model for this task, how do I go about that?

First, I am not looking for coreference resolution, but rather a building block for it. Coreference by definition depends on the context, whereas I am trying to evaluate whether a word in isolation is a subset of "person" or "human". For example:

is_human('effort') # False
is_human('dog') # False
is_human('engineer') # True

My naive attempt to use Gensim's and spaCy's pre-trained word vectors failed to rank "engineer" above the other two words.

import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100") 
for word in ('effort', 'dog', 'engineer'):
    print(word, word_vectors.similarity(word, 'person'))

# effort 0.42303842
# dog 0.46886832
# engineer 0.32456854

I found the following lists from CoreNLP promising.

dcoref.demonym                   // The path for a file that includes a list of demonyms 
dcoref.animate                   // The list of animate/inanimate mentions (Ji and Lin, 2009)
dcoref.inanimate 
dcoref.male                      // The list of male/neutral/female mentions (Bergsma and Lin, 2006) 
dcoref.neutral                   // Neutral means a mention that is usually referred by 'it'
dcoref.female 
dcoref.plural                    // The list of plural/singular mentions (Bergsma and Lin, 2006)
dcoref.singular

Would these work for my task? And if so, how would I access them from the Python wrapper? Thank you.

4

1 回答 1

1

我建议改用WordNet,看看:

  1. 如果 WordNet 涵盖了足够多的条款,并且
  2. 如果您想要的术语是person.n.01.

您必须稍微扩展一下以涵盖多种感官,但要点是:

from nltk.corpus import wordnet as wn

# True
wn.synset('person.n.01') in wn.synset('engineer.n.01').lowest_common_hypernyms(wn.synset('person.n.01'))

# False
wn.synset('person.n.01') in wn.synset('dog.n.01').lowest_common_hypernyms(wn.synset('person.n.01'))

请参阅 NLTK 文档lowest_common_hypernymhttp ://www.nltk.org/howto/wordnet_lch.html

于 2019-04-11T19:44:51.400 回答