3

我需要解析一些法律文件以找到其中的地址。下面是一个例子

quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat。Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur。Exceptioneur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum。”

tmp = test.scan(/(\d{3,6})(.*?)(\d{5})/)
tmp.each do |t|
  puts t.join()
end

通常,地址以数字开头,以邮政编码结尾,但在这些文档中并非总是如此。

问题是我错过了一些并得到了一些不需要的结果,例如:

9999 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris 123 some ave 12 st, some city, NY, 10005
124 some ave 12 st, some city, NY, 10005
125 some ave 12 st, some city, NY, 10005
126 SOMETHING SOMETHING, SOME CITY, NEW YORK et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum 11111

我想要的是以下 4 项的数组:

123 some ave 12 st, some city, NY, 10005
124 some ave 12 st, some city, NY, 10005
125 some ave 12 st, some city, NY, 10005
126 SOMETHING SOMETHING, SOME CITY, NEW YORK

至于最后一项,我很确定所有像这样格式化的地址都会以“纽约”或“纽约”结尾。

我认为我的目标模式是:

/(ANY DIGITS BETWEEN 3 AND 6)(AT LEAST 3 WORDS BUT NOT MORE THAN 10)((TRY FIRST ZIPCODE)|(IF NO ZIP CODE THEN TRY "NEW YORK" OR "NY"))/i

任何帮助将不胜感激。

4

2 回答 2

1

以下是我从法律文本中解析信息的方法:

  1. 将复杂的任务分解为更简单的任务。为要捕获的每个地址变体编写一个正则表达式(或使用正则表达式的函数)。

  2. 为每个变体编写测试用例。以下是我为数字解析器编写的几个测试作为示例。

    测试'554'做                                                                                   
      assert_equal 554, number_parser.parse('五百五十四')                              
    结尾                                                                                             

    测试'1301'做                                                                                  
      assert_equal 1301, number_parser.parse('一千三百一')                                
    结尾                                                                                             
  1. 由于您知道某些值(例如状态和状态缩写)的范围是多少,因此您可以将该知识合并到您的函数中以解析变化。
于 2013-10-06T05:28:12.357 回答
0

正如 michaelmichael 和 stackoverflow.com/questions/9397485/regex-street-address-match 所述,实际上没有办法正确扫描地址,更不用说当文档有大量拼写错误时,如原始示例所示。

所以我把它分成两部分。

首先,一个扫描类似于地址的模式的函数。

# First scan for possible addresses
def look_for_address_patterns(txt)
  resp = []
  # this looks for a number that is between 2-6 digits long (similar to house address)
  # Second part adds an anchor to the next character following it and grabs the next 1-15 items (space or txt)
  # proceeding to either 5 digits (zip code) or ending with State Name / abbrev
  scan = txt.scan(\d{2,6})(\s*(\S+\s+){1,15})((?:\d{5})|(?:NEW YORK|NY))
  scan.each do |s|
    resp.push s.join()
  end
  # Go to step 2 for verifying address before returning anything
  verify_address(resp)
end

现在我们使用 google、mapquest 或 yahoo 等服务来验证地址

def verify_address(arry)
  verified = []
  arry.each do |addr|
    url = "http://maps.googleapis.com/maps/api/geocode/json?address=" + addr
    response = JSON.parse(open(url).read)
    # compare that we got something similar in address response, remove SW and from Lane to ln is ok, but anything else is probably a different address
    matched = addr.downcase[0..8] == response['results']['formatted_address'].downcase[0..8]
    # should be storing more info like lat / lng but that is for a later project
    verified.push(response['results']['formatted_address']) if matched
  end
  return verified
end

到目前为止我所知道的。第一部分工作得很好,但给出了误报和误报(在某些情况下,它完全错过了一个地址。)第二部分有助于清除误报并提供更好的地址格式(合法地址并不总是最好的)。

结果捕获了文档中所有地址的 85%,这对我的项目来说是可以接受的。我确信通过一些微调我可以提出这个问题,所以 Regex Masters 请随时发挥作用。

于 2013-10-07T12:34:06.590 回答