django - 重建 haystack 索引时出现错误“ascii”编解码器无法解码位置 149 中的字节 0xc3：序数不在范围（128）中

Question

我有一个应用程序，我必须在其中存储人们的姓名并使其可搜索。我使用的技术是 python (v2.7.6) django (v1.9.5) rest 框架。dbms 是 postgresql (v9.2)。由于用户名可以是阿拉伯语，我们使用 utf-8 作为 db 编码。对于搜索，我们使用 haystack (v2.4.1) 和 Amazon Elastic Search 进行索引。几天前索引建立良好，但现在当我尝试重建它时

python manage.py rebuild_index

它失败并出现以下错误

'ascii' codec can't decode byte 0xc3 in position 149: ordinal not in range(128)

完整的错误跟踪是

  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 188, in handle_label
    self.update_backend(label, using)
  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 233, in update_backend
    do_update(backend, index, qs, start, end, total, verbosity=self.verbosity, commit=self.commit)
  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 96, in do_update
    backend.update(index, current_qs, commit=commit)
  File "/usr/local/lib/python2.7/dist-packages/haystack/backends/elasticsearch_backend.py", line 193, in update
    bulk(self.conn, prepped_docs, index=self.index_name, doc_type='modelresult')
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 188, in bulk
    for ok, item in streaming_bulk(client, actions, **kwargs):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 160, in streaming_bulk
    for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 85, in _process_bulk_chunk
    resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/__init__.py", line 795, in bulk
    doc_type, '_bulk'), params=params, body=self._bulk_body(body))
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 329, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_requests.py", line 68, in perform_request
    response = self.session.request(method, url, data=body, timeout=timeout or self.timeout)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 455, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 558, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 330, in send
    timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 558, in urlopen
    body=body, headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 353, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python2.7/httplib.py", line 979, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.7/httplib.py", line 1013, in _send_request
    self.endheaders(body)
  File "/usr/lib/python2.7/httplib.py", line 975, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 833, in _send_output
    msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 149: ordinal not in range(128)

我的猜测是，在我们的数据库中没有阿拉伯字符之前，索引构建良好，但现在由于用户输入了阿拉伯字符，索引无法构建。

score 1 · Accepted Answer

如果您使用 requests-aws4auth 包，则可以使用以下包装类代替AWS4Auth该类。它将由创建的标头编码AWS4Auth为字节字符串，从而避免了UnicodeDecodeError下游。

from requests_aws4auth import AWS4Auth

class AWS4AuthEncodingFix(AWS4Auth):
    def __call__(self, request):
        request = super(AWS4AuthEncodingFix, self).__call__(request)

        for header_name in request.headers:
            self._encode_header_to_utf8(request, header_name)

        return request

    def _encode_header_to_utf8(self, request, header_name):
        value = request.headers[header_name]

        if isinstance(value, unicode):
            value = value.encode('utf-8')

        if isinstance(header_name, unicode):
            del request.headers[header_name]
            header_name = header_name.encode('utf-8')

        request.headers[header_name] = value

score 0 · Accepted Answer

我怀疑您对现在出现在数据库中的阿拉伯字符是正确的。

也可能与这个问题有关。第一个链接似乎有某种解决方法，但没有很多细节。我怀疑作者的意思

正确的解决方法是使用 unicode 类型而不是 str 或将默认编码正确设置为（我假设）utf-8。

是你需要检查它运行的机器是LANG=en_US.UTF-8或至少是一些 UTF-8LANG

score 0 · Accepted Answer

Elasticsearch 支持不同的编码，因此使用阿拉伯字符应该不是问题。

由于您使用的是 AWS，我假设您还使用了一些授权库，例如requests-aws4auth。如果是这种情况，请注意在授权期间会添加一些 unicode 标头，例如u'x-amz-date'. 这是一个问题，因为 python 的 httplib 在 _send_output() 期间执行以下操作：msg = "\r\n".join(self._buffer)其中 _buffer 是 HTTP 标头列表。拥有 unicode 标头使得它确实应该是类型的msg（这是不同身份验证库的类似问题）。<type 'unicode'>str

引发异常的行引发异常，msg += message_body因为 python 需要解码message_body为 unicode，因此它与 msg 的类型匹配。由于 py-elasticsearch 已经处理了编码，因此引发了异常，因此我们最终将编码为 unicode 两次，这导致了异常（如here所述）。

您可能想尝试替换 auth 库（例如使用DavidMuller/aws-requests-auth）并查看它是否能解决问题。

django - 重建 haystack 索引时出现错误“ascii”编解码器无法解码位置 149 中的字节 0xc3：序数不在范围（128）中

3 回答 3

Related

Reference