python-2.7 - 使用 nltk 从文本文件中提取特定类型的字符串

Question

我想提取维基百科页面中的所有鱼类型并打印这些鱼（我将内容复制到文本文件）。我使用了 pos 标签，然后使用 chunker 来提取鱼类型。但是我的输出包含其他不需要的数据，这是我实现的代码

import nltk
from nltk.corpus import stopwords
from nltk.chunk.regexp import RegexpParser
#opening the file and reading 
fp = open('C:\\Temp\\fishdata.txt','r')
text = fp.read()
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
sentence_re = r'''(?x)      # set flag to allow verbose regexps
      ([A-Z])(\.[A-Z])+\.?  # abbreviations, e.g. U.S.A.
    | \w+(-\w+)*            # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?      # currency and percentages, e.g. $12.40, 82%
    | \.\.\.                # ellipsis
    | [][.,;"'?():-_`]      # these are separate tokens
'''
chunker = RegexpParser(r'''
 NP:
{<NNP><'fish'>}

 ''')
stpwords = stopwords.words('english')
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
toks = nltk.regexp_tokenize(text, sentence_re)
postoks = nltk.tag.pos_tag(toks)
sent=chunker.parse(postoks)
print sent

我得到的输出

wikipedia
armored 
fish
ray-finned
fish
jelly
fish 
constucutive
then
oragn

需要输出

armored 
fish
jelly
fish
bony
fish

以上只是输出的一小部分，但我需要第二个输出所需的内容输入是维基百科页面 - http://en.wikipedia.org/wiki/Fish，我将其复制到文本文件中。

score 0 · Accepted Answer

from nltk.corpus import wordnet as wn
fish_words = set()
fish_types = set()
for i in wn.all_synsets():
  # if 'fish' exist in a word.
  x = [j for j in i.lemma_names if "fish" in j]
  fish_words.update(x)
  # if a word ends with 'fish'
  y = [j for j in i.lemma_names if "fish" in j[-4:]]
  fish_types.update(y)

print fish_types
print [i.replace("_"," ")[:-4].strip() for i in fish_types]

我不确定您要寻找哪种鱼，但只要您依赖 WordNetLemmatizer，上述方法应该可以为您提供所需的所有鱼。

python-2.7 - 使用 nltk 从文本文件中提取特定类型的字符串

1 回答 1

Related

Reference