python-3.x - 在 Python 中标记泰语文本时出现 UnicodeDecodeError

Question

我正在尝试在 Python 中使用deepcut对泰语文本进行标记，但我收到了 UnicodeDecodeError。

这是我尝试过的

import deepcut

thai = 'ตัดคำได้ดีมาก'
result = deepcut.tokenize(thai)

预期输出：

[\['ตัดคำ','ได้','ดี','มาก'\]][1]

试过：

for i in result:
  print(i.decode('utf-8'))

Error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 0: unexpected end of data

print([i for i in result])

Output: ['\xe0', '\xb8', '\x95', '\xe0', '\xb8', '\xb1', '\xe0', '\xb8', '\x94', '\xe0', '\xb8', '\x84', '\xe0', '\xb8', '\xb3', '\xe0', '\xb9', '\x84', '\xe0', '\xb8', '\x94', '\xe0', '\xb9', '\x89', '\xe0', '\xb8', '\x94', '\xe0', '\xb8', '\xb5', '\xe0', '\xb8', '\xa1', '\xe0', '\xb8', '\xb2', '\xe0', '\xb8', '\x81']

如何让它显示正确的标记化结果，或者是否有更好的方法来标记泰语文本？

score -1 · Accepted Answer

您无需将其转换回 utf-8：

试试看嘛：

import deepcut

thai = 'ตัดคำได้ดีมาก'
result = deepcut.tokenize(thai)

print([i for i in result])

输出：

['ตัด', 'คำ', 'ได้', 'ดี', 'มาก']

除此之外，您还可以尝试这个泰语 NLP 模块

python-3.x - 在 Python 中标记泰语文本时出现 UnicodeDecodeError

1 回答 1

Related

Reference