python - 用 findall() 生成的单词表中缺少法语、西班牙语和德语字符

Question

我正在使用 collections.counter() 和 findall() 函数从 .txt 文件（包含 65000 个单词）创建一个单词表。它适用于英语。但是它忽略了其他语言中的特殊字符，如 â、á、ü、ö 等。此外，我希望将“t'appele”和“signifie-t-elle”等组合词添加为一个不同的词。我尝试了各种正则表达式组合但没有成功。有人知道如何使它包含特殊字符吗？下面是我的代码。

with open(text_to_load) as f:
    words_from_text = collections.Counter(
        word.lower()
        for line in f
        for word in re.findall(r'\b[^\W\d_]+\b', line, re.UNICODE))```

score 0 · Accepted Answer

Thanks a lot, you really helped me greatly with the encoding. I had a further problem with \W in regex which doesn't seem to allow French characters. But I solved it this way instead:

with open(text_to_load, "r", encoding='utf-8') as f:
    for line in f:
        line = line.replace(".", " ")
        line = line.replace("—&quot;, " ")
        line = line.replace(",", " ")
        line = line.lower()
        for word in line.split():
            if word in words_from_text:
                words_from_text[word] = int(int(words_from_text[word]) + 1)
            else:
                words_from_text[word] = int("1")

python - 用 findall() 生成的单词表中缺少法语、西班牙语和德语字符

1 回答 1

Related

Reference