python - Python Polyglot：如何防止连字符分隔属于一起的单词

Question

我正在尝试清理句子以创建更好的词云，但我遇到了连字符拆分属于一起的词的问题。

以下是一个极端情况，我将删除所有数字。2-Mics应该在图像中找到，而不仅仅是Mics：

  "text": "ReSpeaker 2-Mics Pi HAT - Seeed Wiki",
  "lang": "English",
  "confidence": 97.0,
  "tags": [
    [
      "Mics",
      "NUM"
    ],
    [
      "Pi",
      "NOUN"
    ],
    [
      "HAT",
      "PROPN"
    ],
    [
      "Seeed",
      "NUM"
    ],
    [
      "Wiki",
      "NOUN"
    ]
  ]
},

或者K2-18b也比K2词云中的其他地方更有意义18b.

{
  "text": "Supererde: Forscher finden erstmals Wasser auf K2-18b - SPIEGEL ONLINE",
  "lang": "German",
  "confidence": 98.0,
  "tags": [
    [
      "Supererde",
      "PROPN"
    ],
    [
      "Forscher",
      "NOUN"
    ],
    [
      "finden",
      "VERB"
    ],
    [
      "Wasser",
      "NOUN"
    ],
    [
      "K2",
      "PROPN"
    ],
    [
      "18b",
      "PROPN"
    ],
    [
      "SPIEGEL",
      "PROPN"
    ],
    [
      "ONLINE",
      "PROPN"
    ]
  ]
},

破折号可以去掉，完全没问题。例如，在K2-18b和SPIEGEL在段之间K2-18b - SPIEGEL。

这是另一种情况，其中尊重连字符是有意义的：

{
  "text": "docker-spacy-alpine/Dockerfile at master \u00b7 cluttered-code/docker-spacy-alpine",
  "lang": "English",
  "confidence": 98.0,
  "tags": [
    [
      "docker",
      "NUM"
    ],
    [
      "spacy",
      "NUM"
    ],
    [
      "Dockerfile",
      "NUM"
    ],
    [
      "master",
      "NOUN"
    ],
    [
      "cluttered",
      "VERB"
    ],
    [
      "code",
      "NOUN"
    ],
    [
      "docker",
      "NUM"
    ],
    [
      "spacy",
      "NUM"
    ],
    [
      "alpine",
      "ADJ"
    ]
  ]
},

因为这最终会docker-spacy-alpine Dockerfile cluttered-code像图像中一样，docker-spacy-alpine更加突出。

这是我正在使用的代码

from polyglot.text import Text

#...

for item in result:
  if 'title' in item:
    text = Text(item['title'])
    if text.language.code in ['en', 'de']:
      tags = []
      try:
        unfiltered_tags = text.pos_tags
        for tag in unfiltered_tags:
          try:
            x = float(tag[0])
          except:
            if tag[1] in ['NUM', 'ADJ', 'VERB', 'PROPN', 'INTJ', 'NOUN']:
              tags.append(tag)
      except:
        traceback.print_exc()
      titles.append({
        'text': item['title'],
        'lang': text.language.code,
        'confidence': text.language.confidence,
        'tags': tags,
      })

有没有办法调整polyglot它不会进行这种拆分，还是我需要对句子进行一些手动后处理？

python - Python Polyglot：如何防止连字符分隔属于一起的单词

0 回答 0

Related

Reference