python - 文本挖掘UnicodeDecodeError：'charmap'编解码器无法解码位置1671718的字节0x81：字符映射到

Question

我已经编写了代码来创建频率表。但它正在中断 ext_string = document_text.read().lower(。我什至试了一下，除了捕捉错误，但它没有帮助。

import re
import string
frequency = {}
file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
    try:
        count = frequency.get(word,0)
        frequency[word] = count + 1
    except UnicodeDecodeError:
        pass

frequency_list = frequency.keys()

for words in frequency_list:
    print (words, frequency[words])

score 2 · Accepted Answer

您打开文件两次，第二次没有指定编码：

file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')

您应该按如下方式打开文件：

frequencies = {}
with open('EVG_text mining.txt', encoding="utf8", mode='r') as f:
    text = f.read().lower()

match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)
...

第二次打开文件时，您没有定义要使用的编码，这可能是它出错的原因。with 语句有助于执行与文件的 I/O 相关的某些任务。您可以在此处阅读更多相关信息：https ://www.pythonforbeginners.com/files/with-statement-in-python

您可能应该看看错误处理以及您没有包含实际导致错误的行：https ://www.pythonforbeginners.com/error-handling/

忽略所有解码问题的代码：

import re
import string  # Do you need this?

with open('EVG_text mining.txt', mode='rb') as f:  # The 'b' in mode changes the open() function to read out bytes.
    bytes = f.read()
    text = bytes.decode('utf-8', 'ignore') # Change 'ignore' to 'replace' to insert a '?' whenever it finds an unknown byte.

match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)

frequencies = {}
for word in match_pattern:  # Your error handling wasn't doing anything here as the error didn't occur here but when reading the file.
    count = frequencies.setdefault(word, 0)
    frequencies[word] = count + 1

for word, freq in frequencies.items():
    print (word, freq)

score -1 · Accepted Answer

-1

要读取包含一些特殊字符的文件，请使用编码为 'latin1' 或 'unicode_escape'

于 2020-11-15T16:59:36.600 回答

python - 文本挖掘UnicodeDecodeError：'charmap'编解码器无法解码位置1671718的字节0x81：字符映射到

2 回答 2

Related

Reference