command-line - 如何在 PyCharm 中运行这个 Polyglot 令牌/标签提取器？

Question

我正在评估各种命名实体识别 (NER) 库，并且正在尝试Polyglot。

一切似乎进展顺利，但说明告诉我在命令提示符下使用这一行：

!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en ner | tail -n 20

...应该给出（在示例中）此输出：

,               O
which           O
was             O
equalled        O
five            O
days            O
ago             O
by              O
South           I-LOC
Africa          I-LOC
in              O
their           O
victory         O
over            O
West            I-ORG
Indies          I-ORG
in              O
Sydney          I-LOC
.               O

这正是我的项目所需要的那种输出，它的工作方式与我需要它的工作方式完全一样；但是，我需要在我的 PyCharm 界面而不是命令行中运行它，并将结果存储在 pandas 数据框中。我如何翻译该命令？

score 1 · Accepted Answer

假设正确安装了 polyglot，并且在 pycharm 中选择了正确的环境。如果没有在new conda environment具有必要要求的情况下安装 polyglot。创建一个新项目并在 pycharm 中选择现有的 conda 环境。如果language embeddings,ner模型不是，downloaded那么它们应该被下载。

代码：

from polyglot.text import Text

blob = """, which was equalled five days ago by South Africa in the victory over West Indies in Sydney."""
text = Text(blob)
text.language = "en"


## As list all detected entities
print("As list all detected entities")
print(text.entities)

print()

## Separately shown detected entities
print("Separately shown detected entities")
for entity in text.entities:
    print(entity.tag, entity)

print()

## Tokenized words of sentence
print("Tokenized words of sentence")
print(text.words)

print()

## For each token try named entity recognition.
## Not very reliable it detects some words as not English and tries other languages.
## If other embeddings are not installed or text.language = "en" is commented then it may give error.
print("For each token try named entity recognition")
for word in text.words:
    text = Text(word)
    text.language = "en"

    ## Separately
    for entity in text.entities:
        print(entity.tag, entity)

输出：

As list all detected entities
[I-LOC(['South', 'Africa']), I-ORG(['West', 'Indies']), I-LOC(['Sydney'])]

Separately shown detected entities
I-LOC ['South', 'Africa']
I-ORG ['West', 'Indies']
I-LOC ['Sydney']

Tokenized words of sentence
[',', 'which', 'was', 'equalled', 'five', 'days', 'ago', 'by', 'South', 'Africa', 'in', 'the', 'victory', 'over', 'West', 'Indies', 'in', 'Sydney', '.']

For each token try named entity recognition
I-LOC ['Africa']
I-PER ['Sydney']

command-line - 如何在 PyCharm 中运行这个 Polyglot 令牌/标签提取器？

1 回答 1

Related

Reference