ruby-on-rails - 从发票中提取产品和价格

Question

我想从 pdf 中提取信息。

以下是策略的摘录，其中使用https://github.com/yob/pdf-reader/将 pdf 转换为 txt 文档。

Vehicle Description          2007, PORSCHE, CAYMAN 3.2

Registration Number          USD-2394                   Vin Number            FSDFKJL23123KFAS


MY COVER DETAILS

Cover                                                                                 USD37.45

我想提取例如车辆描述和保险费用：

vehicle.description => "2007, PORSCHE, CAYMAN 3.2"
vehicle.registration => "USD-2394"
vehicle.cost_of_cover => "37.45"

任何人都可以就适当的方法提出建议。问题是策略的布局可能会发生变化，但数据大多是相同的，只是值不同。

如果正则表达式是要走的路，任何人都可以提供示例代码。

score 1 · Accepted Answer

寻找描述

/Vehicle Description((?!Registration$).*)Registration/m

查找注册号

/Registration Number((?!Vin$).*)Vin/m

寻找保险费用

/Cover(.*)/m

这些都是非常懒惰的正则表达式匹配。但是，您没有提供很多不同的样本。但这些应该让你开始。

示例用法：

match = /Vehicle Description((?!Registration$).*)Registration/m.match(PDFTEXT)

http://www.ruby-doc.org/core-2.0/Regexp.html

score 0 · Accepted Answer

您可以使用正则表达式 (regexp) 轻松完成此操作。假设您的 pdf 文本存储在变量中text：

description = text.scan(/Vehicle Description(.*)Registration/m).flatten[0].strip
registration = text.scan(/Registration Number(.*)Vin/m).flatten[0].strip
cover = text.scan(/Cover(.*)/m).flatten[0].strip

ruby-on-rails - 从发票中提取产品和价格

2 回答 2

Related

Reference