ruby - Ruby 正则表达式匹配四项行

Question

我正在使用 pdf-reader 阅读我的每月财务记录。我感兴趣的所有行都以描述开头，然后是日期 ##/##/#### 然后是两美元的金额 $#.## $#.## 。

像这样：

Gas Station            12/12/2012         $68.00             $485.00

有时，这些数字会有括号 $(4.50) 表示退货或负金额。我希望所有符合此“模式”的行都作为每行 4 项列表返回。因此，我将整条线与未确定数量的空格相匹配，偶尔还会在价格上加上括号。

require 'pdf-reader'
reader = PDF.Reader.new("month.pdf")
reader.pages.each do |page|
  page.split("\n").each do |line|
  if line # MATCHING REGEX HERE
     #HANDLE 4 VALUES FROM REGEX
  end
end

对于任何想了解我如何使用代码的人来说，这里是源代码https://github.com/danielpclark/INGdirect_pdf_processor。随意在您自己的项目中使用它来处理银行数据。

score 2 · Accepted Answer

试试这个正则表达式：

(.*)\s+(\d{2}\/\d{2}\/\d{4})\s*(\(?\$\d+\.\d+\)?)\s+(\(?\$\d+\.\d+\)?)

它将有 4 场比赛：

描述
日期
第一笔金额
第二个金额

这里是 Rubular：http: //rubular.com/r/2mcrGZiAOe

您还可以使用命名匹配，因为它们更优雅（也是x多行正则表达式的修饰符）：

if line_match = line.match(/
    (?<description>.*)\s+
    (?<date>\d{2}\/\d{2}\/\d{4})\s*
    (?<amount_1>\(\$\d+\.\d+\)|\$\d+\.\d+)\s+
    (?<amount_2>\(\$\d+\.\d+\)|\$\d+\.\d+)/x)
  # now you can use: line_match[:date], line_match[:amount_1], etc.

score 1 · Accepted Answer

String.scan是追求这样的数据的好方法：

string = 'This is some text
Gas Station   12/12/2012 $68.00   $485.00
This some more text
Reimbursement 01/01/2012 $(68.00) $(485.00)
'

string.scan(%r{^(.+?) \s+ (\d{1,2}/\d{1,2}/\d{4}) \s+ ([$()\d.]+) \s+ ([$()\d.]+) }x)
[
    [0] [
        [0] "Gas Station",
        [1] "12/12/2012",
        [2] "$68.00",
        [3] "$485.00"
    ],
    [1] [
        [0] "Reimbursement",
        [1] "01/01/2012",
        [2] "$(68.00)",
        [3] "$(485.00)"
    ]
]

ruby - Ruby 正则表达式匹配四项行

2 回答 2

Related

Reference