python - 提取不支持的文档异常

Question

我正在尝试使用 boto3 来运行 textract detect_document_text 请求。

我正在使用以下代码：

client = boto3.client('textract')
response = client.detect_document_text(
             Document={
            'Bytes': image_b64['document_b64']
        }
      )

其中 image_b64['document_b64'] 是我转换的 base64 图像代码，例如https://base64.guru/converter/encode/image网站。

但我收到以下错误：

UnsupportedDocumentException

我做错了什么？

score 0 · Accepted Answer

每个文档：

如果您使用 AWS 开发工具包调用 Amazon Textract，则可能不需要对使用 Bytes 字段传递的图像字节进行 base64 编码。

只有在直接调用 REST API 时才需要 Base64 编码。使用 Python 或 NodeJS SDK 时，请使用本机字节（二进制字节）。

score 0 · Accepted Answer

为了将来参考，我使用以下方法解决了这个问题：

client = boto3.client('textract')
image_64_decode = base64.b64decode(image_b64['document_b64']) 
bytes = bytearray(image_64_decode)
response = client.detect_document_text(
    Document={
        'Bytes': bytes
    }
)

score 0 · Accepted Answer

如果您将 Jupyternotebook 用于图像（.jpg 或 .png），则使用 Boto3，您可以使用：

import boto3
import cv2 
with open(images_path, "rb") as img_file:
  img_str = bytearray(img_file.read())
textract = boto3.client('textract')
response = textract.detect_document_text(Document={'Bytes': img_str})

python - 提取不支持的文档异常

3 回答 3

Related

Reference