c# - Payable Invoice Capturing OR extracting automation

Question

I am creating a desktop/winform application that reads tif/pdf payable invoices and extract all the invoice information to store into database.

I can read the standard barcodes(QR Code, Code39 etc), and some of the payable invoice' standard fields(Invoice Date, Company Name, Address) with OCR (ocr specific region of image) but unable to capture Line items, amount correctly.

I extract information in two phases:
1. Read specific regions based on the template(user mapped region for specific fields)
2. OCR whole page and search for payable invoice standard field names and values

I have idea about following 3 approaches:
1. Create a Template for one type of Invoice and process all invoices.
2. Nural network based engine which need to be trained with sample data to work it based on patterns.
3. Form processing, a kind of OMR. The OCR to look at exact same coordinates where fields were placed on form(during form desing)

Question:
How to extact payable invoice using OCR or some inteligent reader?
Primarily I look for some algorithem (C# + OCR engine)/ philoshpy of payable invoice capturing but reference to some SDK with same feature or solid kind of commercial product would be helpfull too.

I googled and found Abbyy FlexiCapture Engine, IRIS Capture & Extract somewhat promissing but mostly are based on templates, or training. They claim that no template or training required but nothing looks 100 auto capture.

Kindly refere some product (at least with free trial), SDK or Example/sample.

score 9 · Accepted Answer

当然，到 2018 年，情况有所改善。让我回顾一下今天的主要方法：

仍然是原始 OCR 引擎（tesseract、Abbyy、Google OCR 等）和正则表达式（对于一些非常有限的用例，这可能仍然可以正常工作）
Abbyy FlexiCapture Engine - 仍然很强大，但仍然基于模板，如果您愿意为每种特定发票格式定义一个新模板
Rossum Elis（发票）、 TagGun（收据）、... - 基于预先训练的机器学习模型的 API，即可以立即使用和工作，每月提供免费数量
LucidTech , Itemize , ... - 具有类似功能的不易访问的 API（您需要通过演示和销售流程）
Datamolino , CloudFactory , ... - 人工在幕后手动执行数据转录的 API（不同的延迟、定价和准确性结构）

score 5 · Accepted Answer

我进行了研发并得出结论：没有专门的发票捕获 SDK 可以实现 95-100% 的自动化。只有 OCR/ICR 和 Imaging SDK 可以帮助将图像转换为文本/可读文档，但其余的捕获/数据提取完全基于自定义搜索算法（如上面提到的ilya-evdokimov，您需要混合步骤（区域 ocr ，全文ocr，然后是智能数据提取）。我研究了一些非常受欢迎的产品，但他们只是声称自动捕获，但最终他们只是自动提取标准发票字段，但其余工作是相同的，无论是区域ocr还是手动。这就是我建议，但根据性质应用程序还有更多改进：

将关键字段（例如客户的增值税号信息存储在数据库/xml 文件中）
进行整页 OCR，找到关键字段，匹配客户列表并识别/分类文档/图像的类型。
一旦确定文档类型（应付发票/应收发票等），然后查找标准字段
允许用户为每个公司（发票的发送方）为每种类型的文档创建预定义的模板。
比较两种算法（全文 ocr 和 zonal）的结果，保持更准确的结果。

score 2 · Accepted Answer

经过更多的研发（*），现在实际上有了带有 API 的专用 SDK：

首先 - 对于初学者，在https://rossum.ai/developers有演示

现在可以使用 API ( https://docs.api.rossum.ai/ ) 自动执行整个提取过程，如下所示：

上传发票：

invoice_file=$1
endpoint='https://all.rir.rossum.ai'
curl -H "Authorization: secret_key $ELIS_API_KEY" -X POST -F file="@$invoice_file;type=application/pdf" $endpoint/document

下载结果：

invoice_id=$1
endpoint='https://all.rir.rossum.ai'
curl -H "Authorization: secret_key $ELIS_API_KEY" $endpoint/document/$invoice_id

这些 bash 示例来自https://github.com/rossumai/elis-client-examples/

（* 补充一点，API 是我自己在公司的研发工作的直接结果；））

c# - Payable Invoice Capturing OR extracting automation

3 回答 3

Related

Reference