0

我正在使用 prawn gem 阅读计算机生成的 60 页 pdf 报告,其中包含数十个人的财务和人口统计数据。我面临的挑战是我希望能够在扫描每一行时捕获名称/特殊 ID(在同一行上)以及与该人相关的后续行。使用 ruby​​ 的字符串扫描方法,我已经能够以这种方式捕获每个匹配返回行的财务:

[<invoice no.>, <service type>, <modifier (if any)>, <service_date>, <units>, <amount>]

我尝试将 ID 与财务数据关联几行,然后在 ID 更改但没有任何效果时更改它。我会以一种倒退的方式来解决这个问题吗?我对正则表达式的经验很少(一般是编程)。

以下是仅适用于财务数据的代码:

PDF::Reader.new(file).pages.each do |page|
  page.raw_content.scan(/^\(\s(\d{6})\s+\d\s+(\w\d{4})\s+(0580|TT|1C|1C\s+1F)?\s+(\d+\/\d+\/\d+)\s+\d+\/\d+\/\d+\s+(\d+\.\d+)\s+(\d+\.\d+)/) do |line|        
    line.collect {|x| x.strip! if !x.nil?}
    print "#{line.join(' ')}\n"
    Cycle.check_details(line)
  end
end

这是puts page.raw_content产生的样本(这些行中包含很多空白)。

(REG  LOC   CLIENT   SERVICE   NAME                    BIRTH DATE   RECIPIENT ID    PRIOR AUTHORIZATION #)'
(xx   xxx  xxxxx     xxxxxxx    LANNISTER, JAIME         xx/xx/xxxx   xxxx <special ID>)'
(DIAGNOSIS CODES:  887.0)'
( )'
(  INV #   LINE #   PROCEDURE CODE  REVENUE CD   FROM DT   THRU DT     UNITS AMOUNT)'
( <inv num>       1    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       2    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     2.50     41.00)'
( <inv num>       3    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       4    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       5    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       6    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       7    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
(                                                                CLAIM TOTAL
   434.60   CLAIM ACCOUNT REF.  xxxxxxxxxxxxxxxSUP)'

(REG  LOC   CLIENT   SERVICE   NAME                    BIRTH DATE   RECIPIENT ID    PRIOR AUTHORIZATION #)'
(xx   xxx  xxxxx     xxxxxxx    LANNISTER, JOFFREY         xx/xx/xxxx   xxxx <special ID>)'
(DIAGNOSIS CODES:  259.0)'
( )'
(  INV #   LINE #   PROCEDURE CODE  REVENUE CD   FROM DT   THRU DT     UNITS AMOUNT)'
( <inv num>       1    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       2    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     2.50     41.00)'
( <inv num>       3    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       4    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       5    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       6    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       7    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
(                                                                CLAIM TOTAL
   434.60   CLAIM ACCOUNT REF.  xxxxxxxxxxxxxxxSUP)'
4

2 回答 2

1

并非所有内容都可以使用正则表达式进行解析。而且,有时正则表达式在您将数据分解为可管理的块很有用。您的数据是第二种情况的示例。一旦分解了一些,就可以轻松解析各个行。

您的数据令人困惑,但这揭示了它。一旦前导(和尾随)'被剥离,代码使用 将其分成单独的行split,然后使用slice_before将其分成逻辑块。一旦收集了这些,就可以以合理的方式处理每个块:

require 'prettyprint'

data = "(REG  LOC   CLIENT   SERVICE   NAME                    BIRTH DATE   RECIPIENT ID    PRIOR AUTHORIZATION #)'
(xx   xxx  xxxxx     xxxxxxx    LANNISTER, JAIME         xx/xx/xxxx   xxxx <special ID>)'
(DIAGNOSIS CODES:  887.0)'
( )'
(  INV #   LINE #   PROCEDURE CODE  REVENUE CD   FROM DT   THRU DT     UNITS AMOUNT)'
( <inv num>       1    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       2    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     2.50     41.00)'
( <inv num>       3    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       4    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       5    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       6    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       7    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
(                                                                CLAIM TOTAL
  434.60   CLAIM ACCOUNT REF.  xxxxxxxxxxxxxxxSUP)'

(REG  LOC   CLIENT   SERVICE   NAME                    BIRTH DATE   RECIPIENT ID    PRIOR AUTHORIZATION #)'
(xx   xxx  xxxxx     xxxxxxx    LANNISTER, JOFFREY         xx/xx/xxxx   xxxx <special ID>)'
(DIAGNOSIS CODES:  259.0)'
( )'
(  INV #   LINE #   PROCEDURE CODE  REVENUE CD   FROM DT   THRU DT     UNITS AMOUNT)'
( <inv num>       1    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       2    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     2.50     41.00)'
( <inv num>       3    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       4    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       5    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       6    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       7    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
(                                                                CLAIM TOTAL
  434.60   CLAIM ACCOUNT REF.  xxxxxxxxxxxxxxxSUP)'
"

lines = data.gsub(/^\(|\)'$/m, '').split("\n").map{ |s| s.strip }.reject{ |s| s.empty? }.slice_before(/^REG\b/)

此时,lines是一个数组数组。每个子阵列由以“REG”开头的行块组成。每次slice_before看到匹配的新行时/^REG\b/,它都会创建一个新的子数组/块。lines是一个枚举器,类似于从哈希中获取数组或单个键/值对之前的初步对象。您可以迭代枚举器,这是我们想要做的:

patient_data = lines.map { |sub_ary|
  sub_ary[1][/(?:\S+ \s+ ){4} (\S+, \s+ \S+) \s+ (?:\S+ \s+){2} (.+)$/x]
  patient_name, special_id = $1, $2

  invoice_info = sub_ary[5..-3].map{ |line|
    line[/^(\S+) \s+ \S+ \s+ (\S+) \s+ (\S+)/x]
    [$1, $2, $3]
  }

  {
    patient_name: patient_name,
    special_id:   special_id,
    invoice_info: invoice_info
  }
}

pp patient_data

哪个输出:

[{:patient_name=>"LANNISTER, JAIME",
  :special_id=>"<special ID>",
  :invoice_info=>
  [["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"]]},
{:patient_name=>"LANNISTER, JOFFREY",
  :special_id=>"<special ID>",
  :invoice_info=>
  [["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"]]}]

这让你接近,但不能完全解决问题。我故意让您自己弄清楚如何修改代码以从记录中获取您想要的所有字段。

于 2013-08-05T21:18:37.470 回答
0

如果您想测试您的正则表达式,请查看http://rubular.com/

这是一个非常有用的工具,并且在页面底部提供了正则表达式的大部分基础知识

于 2013-08-05T19:50:15.253 回答