1

我正在尝试清理句子以创建更好的词云,但我遇到了连字符拆分属于一起的词的问题。

以下是一个极端情况,我将删除所有数字。2-Mics应该在图像中找到,而不仅仅是Mics

  "text": "ReSpeaker 2-Mics Pi HAT - Seeed Wiki",
  "lang": "English",
  "confidence": 97.0,
  "tags": [
    [
      "Mics",
      "NUM"
    ],
    [
      "Pi",
      "NOUN"
    ],
    [
      "HAT",
      "PROPN"
    ],
    [
      "Seeed",
      "NUM"
    ],
    [
      "Wiki",
      "NOUN"
    ]
  ]
},

或者K2-18b也 比K2词 云 中 的 其他 地方更 有意义18b.

{
  "text": "Supererde: Forscher finden erstmals Wasser auf K2-18b - SPIEGEL ONLINE",
  "lang": "German",
  "confidence": 98.0,
  "tags": [
    [
      "Supererde",
      "PROPN"
    ],
    [
      "Forscher",
      "NOUN"
    ],
    [
      "finden",
      "VERB"
    ],
    [
      "Wasser",
      "NOUN"
    ],
    [
      "K2",
      "PROPN"
    ],
    [
      "18b",
      "PROPN"
    ],
    [
      "SPIEGEL",
      "PROPN"
    ],
    [
      "ONLINE",
      "PROPN"
    ]
  ]
},

破折号可以去掉,完全没问题。例如,在K2-18bSPIEGEL在 段之间K2-18b - SPIEGEL

这是另一种情况,其中尊重连字符是有意义的:

{
  "text": "docker-spacy-alpine/Dockerfile at master \u00b7 cluttered-code/docker-spacy-alpine",
  "lang": "English",
  "confidence": 98.0,
  "tags": [
    [
      "docker",
      "NUM"
    ],
    [
      "spacy",
      "NUM"
    ],
    [
      "Dockerfile",
      "NUM"
    ],
    [
      "master",
      "NOUN"
    ],
    [
      "cluttered",
      "VERB"
    ],
    [
      "code",
      "NOUN"
    ],
    [
      "docker",
      "NUM"
    ],
    [
      "spacy",
      "NUM"
    ],
    [
      "alpine",
      "ADJ"
    ]
  ]
},

因为这最终会docker-spacy-alpine Dockerfile cluttered-code像图像中一样,docker-spacy-alpine更加突出。

这是我正在使用的代码

from polyglot.text import Text

#...

for item in result:
  if 'title' in item:
    text = Text(item['title'])
    if text.language.code in ['en', 'de']:
      tags = []
      try:
        unfiltered_tags = text.pos_tags
        for tag in unfiltered_tags:
          try:
            x = float(tag[0])
          except:
            if tag[1] in ['NUM', 'ADJ', 'VERB', 'PROPN', 'INTJ', 'NOUN']:
              tags.append(tag)
      except:
        traceback.print_exc()
      titles.append({
        'text': item['title'],
        'lang': text.language.code,
        'confidence': text.language.confidence,
        'tags': tags,
      })

有没有办法调整polyglot它不会进行这种拆分,还是我需要对句子进行一些手动后处理?

4

0 回答 0