python - 使用 Amazon Textract 时不受支持的文档格式，

Question

当我尝试解析通过 amazon s3 访问的 pdf 文件时，它给了我一个错误，请求的文档格式不受支持。

我正在使用带有 boto3 的亚马逊文本。当我尝试解析通过 amazon s3 访问的 pdf 文件时，它给了我一个错误，Request has unsupported do cument format。我对此相当陌生，在 textract 的文档中提到确实支持 pdf 文件。

这是我正在使用的代码。

import boto3
textractClient = boto3.client('textract',region_name='us-east-1')
response = textractClient.detect_document_text(
        Document={'S3Object': {'Bucket': 'bucketName', 'Name': 'filename.pdf'}})
blocks = response['Blocks']

这给了我错误，请求的文档格式不受支持。

score 23 · Accepted Answer

detect_document_text() 是一个仅支持 PNG 或 JPG 图像的同步 API。

如果您想处理 PDF 文件，您应该使用名为 start_document_text_detection() 的异步 API。

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_text_detection

python - 使用 Amazon Textract 时不受支持的文档格式，

1 回答 1

Related

Reference