php - php使用正则表达式从非结构化字符串中提取数据

Question

我需要从来自短信的非结构化字符串中提取数据

我需要提取的数据如下

代码：这是一个由 5 个字母数字组成的字符串，它必须包含至少一个数字

身份证件：这是一个 5 到 8 个字符之间的数字字符串，有效格式为：

V55555555
E55555555
55555
55 555
E55 555 555
55 555 555
5 555 555
555 555

我需要提取的数据可以在字符串中的任何位置，我已经对字符串进行了规范化，只用一个替换了重复的空格，并删除了不是空格、数字和字母的任何内容

样品

1. resuelvete 15C20 Pdero Perez c.i. V55.555.555,
2. Pedro Perez resuelvete 15c20 55 555 555,
3. 15c20 Resuelvete 555555 Pedro Perez,
4. Resuelvete 555555 Pedro Perez 15c20

对于代码部分，我尝试了这个正则表达式：

$regex = '/([a-zA-Z0-9]{5})/i';

我也试过这个：$regex = '(?=.{5})(?=.*[A-Z])(?=.*[a-z])(?=.*\d)[a-zA-Z\d]';，我在这里看到但它不起作用（我必须说我不完全理解这个正则表达式）

但它不起作用，它返回字符串的前五个字符，我需要它在这个例子中返回15c20

对于身份文件部分，我尝试了以下方法：

// This not work with spaces
$regex = "/(V|E)?(\d{5,8})/i";

// This not work without spaces
//This fail in first case returning only 7 digits instead of 8
// Also fails in cases  3 and 4,  does not match anything
$regex = "/(V|E)?(\d{1,2}? ?\d{3} ?\d{3})/i";

score 1 · Accepted Answer

这应该适用于代码部分（注意我假设这里必须至少有一个字母字符，否则您将无法区分 ##### 案例的代码和身份）

$code_pattern = '/\b(?=.*[\d].*\b)(?=.*[a-zA-Z].*\b)[a-zA-Z\d]{5}\b/';

请注意，该(?=....)语法是所谓的积极前瞻。它用于断言模式的该部分中即将出现的值与该模式匹配（实际上不计为匹配中的字符。

对于身份部分，我会保持简单（即不要寻找一个适合所有解决方案的正则表达式）并在您的preg_*函数中使用一系列模式。

$identity_patterns = array(
    '/\b(V|E)[0-9]{8}\b/', // V########, E########
    '/\b[\d]{5}\b/', // #####
    '/\bE[\d]{2}\w[\d]{3}\w[\d]{3}\b/', // E## ### ###
    '/\b[\d]{1,3}\w[\d]{3}(\w[\d]{3})?\b/' // #{1,3} ### (###)?
);

当然可以将所有这些统一到一个正则表达式中，但是如果将来需要，它会使得阅读和修改变得非常困难。

score 0 · Accepted Answer

[编辑]我\w将以下正则表达式模式中的所有 's 更改为[a-zA-Z0-9]因为\w还包含下划线（ _ ）字符，我在写这个答案时忘记了它。

对于代码，您可以使用类似

~\b(?=[a-zA-Z0-9]{5}\b)[a-zA-Z0-9]*\d[a-zA-Z0-9]*~

可以分解成

\b                  # a word boundary (beginning of word in this case)
(?=                 # from here on...
    [a-zA-Z0-9]{5}  # 5 alphanumeric characters (a-z,A-Z,0-9)
    \b              # followed by a word boundary (end of word)
)
[a-zA-Z0-9]*        # 0 or more alphanumeric characters (a-z,A-Z,0-9)
\d                  # a decimal
[a-zA-Z0-9]*        # 0 or more alphanumeric characters (a-z,A-Z,0-9)

对于 id 文档...如果除了您提供的示例之外没有其他可能的分组，您可以使用类似

'~(?<=\b|\bE|\bV)(?=[\d\ .]{5,10}\b)(
    \d{8}|
    \d{5}|
    \d{3}[\ .]\d{3}|
    \d{2}([\ .])\d{3}(\2\d{3})?|
    \d([\ .])\d{3}\4\d{3}
)~x'

这说

(?<=                        # preceded by
    \b|                     # a word boundary (beginning of word) or
    \bE|                    # a word boundary (beginning of word) and E or
    \bV                     # a word boundary (beginning of word) and V
)
(?=                         # from here on...
    [\d\ .]{5,10}           # a group consisting of only decimals, spaces and dots 
    \b                      # followed by a word boundary (end of word)
)
(                           # either:
    \d{8}|                  # an 8 digit number
    \d{5}|                  # a 5 digit number
    \d{3}[\ .]\d{3}|        # a 3 digit number, space or dot, a 3 digit number
    \d{2}([\ .])\d{3}       # a 2 digit number, space or dot, a 3 digit number
    (                       # optionally...
        \2                  # previous sign (space or dot)
        \d{3}               # a 3 digit number
    )?| 
    \d([\ .])\d{3}\4\d{3}   # a 1 digit number, space or dot, a 3 digit number, previous sign, a 3 digit number
)

php - php使用正则表达式从非结构化字符串中提取数据

2 回答 2

Related

Reference