我需要解析一些法律文件以找到其中的地址。下面是一个例子
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat。Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur。Exceptioneur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum。”
tmp = test.scan(/(\d{3,6})(.*?)(\d{5})/)
tmp.each do |t|
puts t.join()
end
通常,地址以数字开头,以邮政编码结尾,但在这些文档中并非总是如此。
问题是我错过了一些并得到了一些不需要的结果,例如:
9999 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris 123 some ave 12 st, some city, NY, 10005
124 some ave 12 st, some city, NY, 10005
125 some ave 12 st, some city, NY, 10005
126 SOMETHING SOMETHING, SOME CITY, NEW YORK et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum 11111
我想要的是以下 4 项的数组:
123 some ave 12 st, some city, NY, 10005
124 some ave 12 st, some city, NY, 10005
125 some ave 12 st, some city, NY, 10005
126 SOMETHING SOMETHING, SOME CITY, NEW YORK
至于最后一项,我很确定所有像这样格式化的地址都会以“纽约”或“纽约”结尾。
我认为我的目标模式是:
/(ANY DIGITS BETWEEN 3 AND 6)(AT LEAST 3 WORDS BUT NOT MORE THAN 10)((TRY FIRST ZIPCODE)|(IF NO ZIP CODE THEN TRY "NEW YORK" OR "NY"))/i
任何帮助将不胜感激。