node.js - Textract 异步读取 PDF

Question

从文本文档：Documents for synchronous operations can be in PNG or JPEG format. Documents for asynchronous operations can also be in PDF format.

我有一个 Node.js 应用程序，我在其中使用异步 Textract 读取 PDF 文件。我的代码如下所示：

import * as AWS from 'aws-sdk';

const textract = new AWS.Textract({ region: '<REGION>' });

export const callTextract = (file: File, uuid: string): Promise<any> => {
  return new Promise<any>((resolve, reject) => {
    const params = {
      Document: {
        Bytes: file,
      },
    };
    textract.detectDocumentText(params, (err, data) => {
      ....
      resolve(data);
    });
  })
}

此处的文件已从操作系统中读取，为 Buffer 格式。由于前 4 个字节，我可以确认它是 PDF 文件（Detecting file type from buffer in node js?）：

 <Buffer 25 50 44 46 ... >

我收到的错误是UnsupportedDocumentException.

score 0 · Accepted Answer

您可以在同步和异步 API 中提供字节字段，但字节字段定义在两个 API 中是相同的

一组 base64 编码的文档字节。以 blob 字节提供的文档的最大大小为 5 MB。文档字节必须是 PNG 或 JPEG 格式。

因此您不能上传 PDF 格式的字节字段值

来自文档：https ://docs.aws.amazon.com/textract/latest/dg/API_Document.html#API_Document_Contents

score 0 · Accepted Answer

detectDocumentText()是同步的。异步版本是startDocumentTextDetection().

见文档：

检测输入文档中的文本。Amazon Textract 可以检测文本行和构成文本行的单词。输入文档必须是 JPEG 或 PNG 格式的图像。

...

DetectDocumentText 是一个同步操作。要异步分析文档，请使用 StartDocumentTextDetection。

请注意，语言的异步机制与 API 的异步调用不同。对于异步 API，总会有至少两个调用。在这种情况下，另一个是getDocumentTextAnalysis()。

...尽管我认为这是另一个糟糕的 AWS 文档示例。

node.js - Textract 异步读取 PDF

2 回答 2

Related

Reference