-1

我正在使用 collections.counter() 和 findall() 函数从 .txt 文件(包含 65000 个单词)创建一个单词表。它适用于英语。但是它忽略了其他语言中的特殊字符,如 â、á、ü、ö 等。此外,我希望将“t'appele”和“signifie-t-elle”等组合词添加为一个不同的词。我尝试了各种正则表达式组合但没有成功。有人知道如何使它包含特殊字符吗?下面是我的代码。

with open(text_to_load) as f:
    words_from_text = collections.Counter(
        word.lower()
        for line in f
        for word in re.findall(r'\b[^\W\d_]+\b', line, re.UNICODE))```
4

1 回答 1

0

Thanks a lot, you really helped me greatly with the encoding. I had a further problem with \W in regex which doesn't seem to allow French characters. But I solved it this way instead:

with open(text_to_load, "r", encoding='utf-8') as f:
    for line in f:
        line = line.replace(".", " ")
        line = line.replace("—", " ")
        line = line.replace(",", " ")
        line = line.lower()
        for word in line.split():
            if word in words_from_text:
                words_from_text[word] = int(int(words_from_text[word]) + 1)
            else:
                words_from_text[word] = int("1")
于 2020-08-15T11:32:11.950 回答