2

我知道它不可能完美,但我对正则表达式不是很好,而且我很难获得更好的匹配百分比。

我有一个包含超过 900 万行的文件,并且地址非常不一致。我想知道我是否可以从这里比我更好的人那里得到一些帮助。任何帮助将不胜感激。

这就是我到目前为止所拥有的。我认为解决此问题的最佳方法是尝试从字符串末尾匹配模式,因为 apt、bx、po box 等可能位于字符串的开头。

/(\d+\-\d+\s+|\d+-\D+|APT\s\D|APT\s\d+|APT\s\D\d+|APT\s\D\s\d+|SPACE\s\d+|POBOX\s\d+|BX|UNIT\s\d+|\d+-\d+|\d+)\s(.+)\s{2,}(\D+)\s(\D{2})$/

我可以看到的几种模式。大量空格与文件中一样。我尝试在 2 个或更多空格以及我迄今为止的正则表达式中进行拆分。

F_NAME L_NAMEFOR F_NAME L_NAME          ADDRESS ZIP         CITY STATE

ADDRESS        CITY STATE

ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY STATE

APT #               ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY STATE

P O BOX #             ADDRESS        CITY STATE

APT DIGIT#         ADDRESS CITY STATE 

SPACE DIGIT    ADDRESS      CITY STATE

UNIT #         ADDRESS     CITY STATE

SP DIGIT          ADDRESS      CITY STATE

DIGITS-DIGITS ADDRESS       CITY STATE

BX DIGIT       ADDRESS         CITY STATE

ADDRESS     APT #      CITY STATE

ADDRESS       UNIT #     CITY STATE

ADDRESS   P O BOX   DIGIT     CITY STATE

P O B O X    DIGIT      CITY STATE

P O BOX DIGIT    CITY      STATE

ADDRESS    SPACE/SP/SPC/UNIT DIGIT     CITY STATE
4

2 回答 2

4

这是一个相当复杂的问题,遗憾的是没有简单的解决方案。

诚然,您可以尝试以下正则表达式,这远非完美:

^.*?(?<address>(?:\b(?:[a-zA-Z0-9.,:;\\\/#-]|\s(?=\S))*?(?<zip>\d{5}(?:-\d{4}|-\d{6})?)?\b)?)\s{2,}(?<city>\b(?:\w|\s(?=\S))+\b)\s{1,}(?<state>\b\w{2,3}\b)(?:$|\r|\n)

在此处输入图像描述

在图像中,组 1 = 地址;第 2 组 = 邮编;第 3 组 = 城市;第 4 组 = 状态

输入,注意我STATE改为st; zip12345; 和邮政信箱digits到实际数字

F_NAME L_NAMEFOR F_NAME L_NAME          ADDRESS 12345         CITY st
ADDRESS        CITY st
ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY st
APT #               ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY st
P O BOX # 1234            ADDRESS        CITY st
APT DIGIT#         ADDRESS CITY st
SPACE DIGIT    ADDRESS      CITY st
UNIT #         ADDRESS     CITY st
SP DIGIT          ADDRESS      CITY st
DIGITS-DIGITS ADDRESS       CITY st
BX DIGIT       ADDRESS         CITY st
ADDRESS     APT #      CITY st
ADDRESS       UNIT #     CITY st
ADDRESS   P O BOX   3245     CITY st
P O B O X    123      CITY st
P O BOX 345    CITY      st
ADDRESS    SPACE/SP/SPC/UNIT DIGIT     CITY st

火柴

[0] => Array
(
    [0] => F_NAME L_NAMEFOR F_NAME L_NAME          ADDRESS 12345         CITY st
    [1] => ADDRESS        CITY st
    [2] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY st
    [3] => APT #               ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY st
    [4] => P O BOX # 1234            ADDRESS        CITY st
    [5] => APT DIGIT#         ADDRESS CITY st
    [6] => SPACE DIGIT    ADDRESS      CITY st
    [7] => UNIT #         ADDRESS     CITY st
    [8] => SP DIGIT          ADDRESS      CITY st
    [9] => DIGITS-DIGITS ADDRESS       CITY st
    [10] => BX DIGIT       ADDRESS         CITY st
    [11] => ADDRESS     APT #      CITY st
    [12] => ADDRESS       UNIT #     CITY st
    [13] => ADDRESS   P O BOX   DIGIT     CITY st
    [14] => P O B O X    123      CITY st
    [15] => P O BOX 345    CITY      st
    [16] => ADDRESS    SPACE/SP/SPC/UNIT DIGIT     CITY st
)

[address] => Array
(
    [0] => ADDRESS 12345
    [1] => ADDRESS
    [2] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S
    [3] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S
    [4] => ADDRESS
    [5] => APT DIGIT#
    [6] => ADDRESS
    [7] => ADDRESS
    [8] => ADDRESS
    [9] => DIGITS-DIGITS ADDRESS
    [10] => ADDRESS
    [11] => APT #
    [12] => UNIT #
    [13] => DIGIT
    [14] => 123
    [15] => P O BOX 345
    [16] => SPACE/SP/SPC/UNIT DIGIT
)

[zip] => Array
    (
        [0] => 12345
        [1] => 
        [2] => 
        [3] => 
        [4] => 
        [5] => 
        [6] => 
        [7] => 
        [8] => 
        [9] => 
        [10] => 
        [11] => 
        [12] => 
        [13] => 
        [14] => 
        [15] => 
        [16] => 
    )

[city] => Array
(
    [0] => CITY
    [1] => CITY
    [2] => CITY
    [3] => CITY
    [4] => CITY
    [5] => ADDRESS CITY
    [6] => CITY
    [7] => CITY
    [8] => CITY
    [9] => CITY
    [10] => CITY
    [11] => CITY
    [12] => CITY
    [13] => CITY
    [14] => CITY
    [15] => CITY
    [16] => CITY
)


[state] => Array
(
    [0] => st
    [1] => st
    [2] => st
    [3] => st
    [4] => st
    [5] => st
    [6] => st
    [7] => st
    [8] => st
    [9] => st
    [10] => st
    [11] => st
    [12] => st
    [13] => st
    [14] => st
    [15] => st
    [16] => st
)

推荐看看问题11160192

于 2013-06-14T20:40:32.933 回答
0

我认为 Denomales 的回答足以满足您的需求,但我将把上面的评论扩展为一个答案,因为我认为有一些特定于您的问题的相关部分。

他们是美国地址吗?您可以尝试使用 API 或工具来批量提取地址。这是最近来自另一个 Stack Overflow 答案的此类工具的示例,其中有一小部分要匹配的地址

在此处输入图像描述

为了披露,我在 SmartyStreets 工作并帮助开发了这个。虽然它不是专门为电子表格或表格地址数据而设计的,但它为自由格式文本等非统一输入而设计的。您甚至可以将数百万行分段拼接到服务中。

也许这会有所帮助,因为它也会在文本中找到地址后验证地址。正如您所发现的那样,地址确实很粗糙,专用工具有时可能是处理它们的最佳方式。并不是说这是您案例的正确答案,但希望仍能提供信息。

于 2013-06-15T16:21:50.577 回答