1

我一直在尝试使用正则表达式“解析”一些数据,我感觉好像我很接近了,但我似乎无法把它全部带回家。
需要解析的数据一般是这样的:<param>: <value>\n. 参数的数量可以变化,就像值一样。不过,这里有一个例子:

FooID:123456
姓名:查克
时间:2013 年 1 月 2 日 01:23:45
内部编号:789654
用户留言:您好,
这是可空的,但可以很长。文本可以分布在多行
并且可以以任意数量的 \n 开头。它也可以是空的。
更糟糕的是,这个 CAN 包含冒号(但它们是_“转义”_ 使用 `\`),甚至是基本标记!

为了将这个文本推入一个对象,我把这个小表达式放在一起

if (preg_match_all('/^([^:\n\\]+):\s*(.+)/m', $this->structuredMessage, $data))
{
    $data = array_combine($data[1], $data[2]);
    //$data is assoc array FooID => 123456, Name => Chuck, ...
    $report = new Report($data);
}

现在,这在大多数情况下都可以正常工作,除了User Messagebit:.不匹配新行,因为如果我要使用s标志,第二组将匹配FooID:字符串末尾之后的所有内容。
我不得不为此使用肮脏的解决方法:

$msg = explode(end($data[1], $string);
$data[2][count($data[2])-1] = array_pop($msg);

经过一些测试,我开始明白有时,一两个参数没有填写(例如InternalID可以为空)。在这种情况下,我的表达式不会失败,而是会导致:

    [1] => 数组
        (
            [0] => FooID
            [1] => 名称
            [2] => 什么时候
            [3] => 内部 ID
        )

    [2] => 数组
        (
            [0] => 123465
            [1] => 查克
            [2] => 2013 年 1 月 2 日 01:23:45
            [3] => 用户评论:您好,
        )

我一直在尝试其他各种表达方式,并想出了这个:

/^([^:\n\\]++)\s{0,}:(.*+)(?!^[^:\n\\]++\s{0,}:)/m
//or:
/^([^:\n\\]+)\s{0,}:(.*)(?!^[^:\\\n]+\s{0,}:)/m

第二个版本稍慢。
这解决了我遇到的问题InternalID: <void>,但仍然给我留下了最后一个障碍:User Message: <multi-line>. 使用s标志对我的表达式 ATM 不起作用。
我只能这么想:

^([^:\n\\]++)\s{0,}:((\n(?![^\n:\\]++\s{0,}:)|.)*+)

至少在我看来,这太复杂了,不能成为唯一的选择。想法,建议,链接,......任何东西都将不胜感激

4

4 回答 4

1

我对 PHP 很陌生,所以也许这完全不合时宜,但也许你可以使用类似的东西

$data = <<<EOT
FooID: 123456
Name: Chuck
When: 01/02/2013 01:23:45
InternalID: 789654
User Message: Hello,
this is nillable, but can be quite long. Text can be spread out over many     lines
And can start with any number of \n's. It can be empty, too
EOT;

if ($key = preg_match_all('~^[^:\n]+?:~m', $data, $match)) {
    $val = explode('¬', preg_filter('~^[^:\n]+?:~m', '¬', $data));

    array_shift($val);

    $res = array_combine($match[0], $val);
}

print_r($res);

产量

Array
(
    [FooID:] =>  123456
    [Name:] =>  Chuck
    [When:] =>  01/02/2013 01:23:45
    [InternalID:] =>  789654
    [User Message:] =>  Hello,
this is nillable, but can be quite long. Text can be spread out over many     lines
And can start with any number of 
's. It can be empty, too
)
于 2013-07-10T13:42:53.427 回答
1

以下正则表达式应该可以工作,但我不确定它是否是正确的工具:

preg_match_all(
    '%^            # Start of line
    ([^:]*)        # Match anything until a colon, capture in group 1
    :\s*           # Match a colon plus optional whitespace
    (              # Match and capture in group 2:
     (?:           # Start of non-capturing group (used for alternation)
      .*$          #  Either match the rest of the line
      (?=          #  only if one of the following follows here:
       \Z          #  The end of the string
      |            #  or
       \r?\n       #  a newline
       [^:\n\\\\]* #  followed by anything except colon, backslash or newline
       :           #  then a colon
      )            #  End of lookahead
     |             # or match
      (?:          #  Start of non-capturing group (used for alternation/repetition)
       [^:\\\\]    #  Either match a character except colon or backslash
      |            #  or
       \\\\.       #  match any escaped character
      )*           #  Repeat as needed (end of inner non-capturing group)
     )             # End of outer non-capturing group
    )              # End of capturing group 2
    $              # Match the end of the line%mx', 
    $subject, $result, PREG_PATTERN_ORDER);

在 regex101 上看到它。

于 2013-07-10T13:51:31.140 回答
0

我想我会避免使用正则表达式来执行此任务,而是将其拆分为子任务。

基本算法大纲

  1. \n在使用时拆分字符串explode
  2. 循环结果数组
    1. 拆分结果字符串:也使用explode限制为 2。
    2. 如果生成的数组的长度小于 2,则将全部数据添加到前一个键的值
    3. 否则,使用第一个数组索引作为键,第二个作为值,除非拆分冒号被转义(在这种情况下,将键 + 拆分 + 值添加到前一个键的值)

该算法确实假设没有带有转义冒号的键。值中的转义冒号将得到很好的处理(即用户输入)。

代码

$str = <<<EOT
FooID: 123456
Name: Chuck
When: 01/02/2013 01:23:45
InternalID: 
User Message: Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
This\: works too. And can start with any number of \\n's. It can be empty, too.
What's worse, though is that this CAN contain colons (but they're _"escaped"_


using `\`) like so `\:`, and even basic markup!
EOT;

$arr = explode("\n", $str);

$prevKey = '';
$split = ': ';
$output = array();
for ($i = 0, $arrlen = sizeof($arr); $i < $arrlen; $i++) {
  $keyValuePair = explode($split, $arr[$i], 2);
  // ?: Is this a valid key/value pair
  if (sizeof($keyValuePair) < 2 && $i > 0) {
    // -> Nope, append the value to the previous key's value
    $output[$prevKey] .= "\n" . $keyValuePair[0];
  }
  else {
    // -> Maybe
    // ?: Did we miss an escaped colon
    if (substr($keyValuePair[0], -1) === '\\') {
      // -> Yep, this means this is a value, not a key/value pair append both key and
      // value (including the split between) to the previous key's value ignoring
      // any colons in the rest of the string (allowing dates to pass through)
      $output[$prevKey] .= "\n" . $keyValuePair[0] . $split . $keyValuePair[1];
    }
    else {
      // -> Nope, create a new key with a value
      $output[$keyValuePair[0]] = $keyValuePair[1];
      $prevKey = $keyValuePair[0];
    }
  }
}

var_dump($output);

输出

array(5) {
  ["FooID"]=>
  string(6) "123456"
  ["Name"]=>
  string(5) "Chuck"
  ["When"]=>
  string(19) "01/02/2013 01:23:45"
  ["InternalID"]=>
  string(0) ""
  ["User Message"]=>
  string(293) "Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
This\: works too. And can start with any number of \n's. It can be empty, too.
What's worse, though is that this CAN contain colons (but they're _"escaped"_


using `\`) like so `\:`, and even basic markup!"
}

在线演示

于 2013-07-10T12:48:50.887 回答
0

所以这就是我想出的使用棘手的方法preg_replace_callback()

$string ='FooID: 123456
Name: Chuck
When: 01/02/2013 01:23:45
InternalID: 789654
User Message: Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
And can start with any number of \n\'s. It can be empty, too
Yellow:cool';

$array = array();
preg_replace_callback('#^(.*?):(.*)|.*$#m', function($m)use(&$array){
    static $last_key = ''; // We are going to use this as a reference
    if(isset($m[1])){// If there is a normal match (key : value)
        $array[$m[1]] = $m[2]; // Then add to array
        $last_key = $m[1]; // define the new last key
    }else{ // else
        $array[$last_key] .= PHP_EOL . $m[0]; // add the whole line to the last entry
    }
}, $string); // Anonymous function used thus PHP 5.3+ is required
print_r($array); // print

在线演示

缺点:我PHP_EOL用来添加与操作系统相关的换行符。

于 2013-07-10T12:51:43.050 回答