python - Microsoft Azure 文本分析认知服务编码问题

Question

为了使用他们的文本分析，Azure 需要一个如下所示的 json 文件/文档：

document = {
  "documents" :[
    {"id": "1", "language": "en", "text": "I had a wonderful experience! The rooms were wonderful and the staff was helpful."},
    {"id": "2", "language": "en", "text": "I had a terrible time at the hotel. The staff was rude and the food was awful."},
    {'id': '3', 'language': 'es', 'text': 'Los caminos que llevan hasta Monte Rainier son espectaculares y hermosos.'},  
    {'id': '4', 'language': 'es', 'text': 'La carretera estaba atascada. Había mucho tráfico el día de ayer.'}]}

我目前遇到的问题是最后一条记录id: 4导致此错误：

b'{"code":"BadRequest","message":"Invalid request","innerError":{"code":"InvalidRequestBodyFormat","message":"Request body format is wrong. 
Make sure the json request is serialized correctly and there are no null members."}}'

JSON 的格式是正确的，它直接来自他们的站点，并且在没有最后一条记录的情况下运行得非常好。我进行了更多测试，然后发现íandá是引发错误的那些。为了确保这一点，我什至用像简历或未婚夫这样的英文单词对其进行了测试，但仍然是同样的错误。但这没有任何意义，因为西班牙语是文本分析支持的语言之一，而且文本语言在处理之前甚至被定义为西班牙语。

所以我的问题是，在通过 Azure 传递我的数据之前我是否遗漏了什么？我是想转换、更改编码或删除这些字符，还是 Azure 的 API 应该能够处理这些字符？

编辑：更多背景知识，我按照他们网站上提供的说明将其设置为与python一起使用。除了我提到的之外，它工作得很好。

score 0 · Accepted Answer

感谢@ADyson 弄明白了。

您必须确保输入被编码为UTF-8或UTF-16以使其正确运行。

python - Microsoft Azure 文本分析认知服务编码问题

1 回答 1

Related

Reference