我正在尝试清理句子以创建更好的词云,但我遇到了连字符拆分属于一起的词的问题。
以下是一个极端情况,我将删除所有数字。2-Mics
应该在图像中找到,而不仅仅是Mics
:
"text": "ReSpeaker 2-Mics Pi HAT - Seeed Wiki",
"lang": "English",
"confidence": 97.0,
"tags": [
[
"Mics",
"NUM"
],
[
"Pi",
"NOUN"
],
[
"HAT",
"PROPN"
],
[
"Seeed",
"NUM"
],
[
"Wiki",
"NOUN"
]
]
},
或者K2-18b
也 比K2
词 云 中 的 其他 地方更 有意义18b
.
{
"text": "Supererde: Forscher finden erstmals Wasser auf K2-18b - SPIEGEL ONLINE",
"lang": "German",
"confidence": 98.0,
"tags": [
[
"Supererde",
"PROPN"
],
[
"Forscher",
"NOUN"
],
[
"finden",
"VERB"
],
[
"Wasser",
"NOUN"
],
[
"K2",
"PROPN"
],
[
"18b",
"PROPN"
],
[
"SPIEGEL",
"PROPN"
],
[
"ONLINE",
"PROPN"
]
]
},
破折号可以去掉,完全没问题。例如,在K2-18b
和SPIEGEL
在 段之间K2-18b - SPIEGEL
。
这是另一种情况,其中尊重连字符是有意义的:
{
"text": "docker-spacy-alpine/Dockerfile at master \u00b7 cluttered-code/docker-spacy-alpine",
"lang": "English",
"confidence": 98.0,
"tags": [
[
"docker",
"NUM"
],
[
"spacy",
"NUM"
],
[
"Dockerfile",
"NUM"
],
[
"master",
"NOUN"
],
[
"cluttered",
"VERB"
],
[
"code",
"NOUN"
],
[
"docker",
"NUM"
],
[
"spacy",
"NUM"
],
[
"alpine",
"ADJ"
]
]
},
因为这最终会docker-spacy-alpine
Dockerfile
cluttered-code
像图像中一样,docker-spacy-alpine
更加突出。
这是我正在使用的代码
from polyglot.text import Text
#...
for item in result:
if 'title' in item:
text = Text(item['title'])
if text.language.code in ['en', 'de']:
tags = []
try:
unfiltered_tags = text.pos_tags
for tag in unfiltered_tags:
try:
x = float(tag[0])
except:
if tag[1] in ['NUM', 'ADJ', 'VERB', 'PROPN', 'INTJ', 'NOUN']:
tags.append(tag)
except:
traceback.print_exc()
titles.append({
'text': item['title'],
'lang': text.language.code,
'confidence': text.language.confidence,
'tags': tags,
})
有没有办法调整polyglot
它不会进行这种拆分,还是我需要对句子进行一些手动后处理?