.net - 如何使用 aws textract 服务和 .net 从文档中导出 CSV 表（PDF/图像）

Question

我试图从使用 C#/.NET 的 AWS textract 服务中使用 DetectDocument（异步）从 PDF 文件中提取表和数据。

我在数据提取方面取得了成功，但无法弄清楚如何使用 AnalyzeDocument 提取 PDF 中的表格并导出为 CSV 文件。

阅读 AWS 文档，发现 CSV 提取是在 Python 中而不是在 .NET 中。参考链接：- https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html

尝试查看 Python 代码并为 .NET 复制，但没有成功。

score 0 · Accepted Answer

我们可以使用这段代码，循环遍历由 textract 的 GetDocumentTextAnalysis() 返回的块中的关系，并获取与其链接的所有子节点。

var relationships = block.Relationships;
    if(relationships != null && relationships.Count > 0) {
        relationships.ForEach(r => {
            if(r.Type == "CHILD") {
                r.Ids.ForEach(id => {
                    var cell = new Cell(blocks.Find(b => b.Id == id), blocks);
                    if(cell.RowIndex > ri) {
                        this.Rows.Add(row);
                        row = new Row();
                        ri = cell.RowIndex;
                    }
                    row.Cells.Add(cell);
                });
                if(row != null && row.Cells.Count > 0)
                    this.Rows.Add(row);
            }
        });
    }

供参考 - 请参阅底部的链接以获取代码：-

https://github.com/aws-samples/amazon-textract-code-samples/blob/master/src-csharp/TextractExtensions/Table.cs

.net - 如何使用 aws textract 服务和 .net 从文档中导出 CSV 表（PDF/图像）

1 回答 1

Related

Reference