0

I have modified this sample to read PDFs in tabular format. I would like to keep the tabular structure of the original pdf when doing the human review process. I notice the custom worker task template uses the crowd-entity-annotation element which seems to read only texts. I am aware that the human reviewer process reads from an S3 key which contains raw text written by the textract process.

I have been considering writing to S3 using tabulate but I don't think that is the best solution. I would like to keep the structure and still have the ability to annotate custom entities.

4

1 回答 1

1

Comprehend 现在原生支持检测 pdf 文档的自定义实体。为此,您可以尝试以下步骤:

  1. 按照这个github 自述文件开始 PDF 文档的注释过程。
  2. 一旦产生注释。您可以使用 Comprehend CreateEntityRecognizer API 为半结构化文档训练自定义实体模型”</li>
  3. 训练实体识别器后,您可以使用 StartEntitiesDetectionJob API 对 PDF 文档运行推理
于 2021-11-01T17:21:14.410 回答