3

想象一个这样的字符串:

field1,field2(subfield1),field3(subfield2,subfield3),field4(),field5(subfield4(subsubfield,subsubfield2))

我想得到一个这样的数组:

array(
    field1 => array(),
    field2 => array(subfield1),
    field3 => array(
        subfield2,
        subfield3
    ),
    field4 => array(),
    field5 => array(
        subfield4 => array(
            subsubfield => array(),
            subsubfield => array()
        )
    )
)

我有这个正则表达式[a-zA-Z0-9]*\([^()]*(?:(?R)[^()]*)*\),它可以输出一些工作:

array(
    field1,
    field2(subfield1),
    field3(subfield2,subfield3),
    field4(),
    field5(subfield4(subsubfield,subsubfield2))
)

虽然这不是我想要的。我现在有点卡住了,但到目前为止我提出的选择是:

  1. 用 preg_replace_callback 做点什么
  2. 为这些类型的分层逗号分隔字符串编写某种自定义解析器
  3. 退出,喝醉,明天破案(好吧,不是一个选择,现在必须这样做)

对于它的价值,无论如何我都必须遍历字段和子字段。我有一些代码使用给定的正则表达式,并在以后需要时对其值运行相同的匹配。我想一次解析整个字符串,包括它的嵌套子字符串。

有谁知道我是如何开始的?哪个选项是最好的(或更好的)方法?(可读性 vs 资源使用 vs 复杂性 vs 等等)

4

3 回答 3

4

介绍

您描述的问题不能用常规语言表示,因为常规语言无法平衡括号。然而,多年来,大多数正则表达式实现都添加了一些特性,允许解析比正则语言更复杂的语言。特别是,这个问题可以通过 .NET 的平衡匹配或PCRE 的递归表达式来解决(感谢@Gumbo,在评论中指出了这一点)。

但是,仅仅因为您可以做某事并不意味着您应该做某事。将正则表达式用于此类任务的问题在于,随着您扩展元语言,修改正则表达式将变得更加困难。而解析器往往更具可塑性且易于扩展。

因此,您可能可以构建一系列正则表达式来涵盖输入的非病态情况,但是当您可以编写一个解析器时,为什么还要尝试呢?它们易于维护,速度极快(比正则表达式快),易于扩展,而且启动起来很有趣。

本来我错过了这个问题是在寻找PHP的解决方案,所以我用JavaScript写了它。我将它翻译成 PHP,并在文章末尾留下了原始的 JavaScript 解决方案。

PHP解决方案

function parse( $s ) {
    // we will always have a "current context".  the current context is the array we're
    // currently operating in.  when we start, this is simply an empty array.  as new
    // arrays are created, this context will change.
    $context = array();
    // since we have to keep track of how deep our context is, we keep a context stack
    $contextStack = array(&$context);
    // this accumulates the name of the current array
    $name = '';
    for( $i=0; $i<strlen($s); $i++ ) {
        switch( $s[$i] ) {
            case ',':
                // if the last array hasn't been added to the current context
                // (as will be the case for arrays lacking parens), we add it now
                if( $name!='' && !array_key_exists( $name, $context ) )
                    $context[$name] = array();
                // reset name accumulator
                $name = '';
                break;
            case '(':
                // we are entering a subcontext

                // save a new array in the current context; this will become our new context
                $context[$name] = array();
                // switch context and add to context stack
                $context = &$context[$name];
                $contextStack[] = &$context;
                // reset name accumulator
                $name = '';
                break;
            case ')':
                // we are exiting a context

                // if we haven't saved this array in the current context, do so now
                if( $name!='' && !array_key_exists( $name, $context ) )
                    $context[$name] = array();
                // we can't just assign $context the return value of array_pop because
                // that does not return a reference
                array_pop($contextStack);
                if( count($contextStack) == 0 ) throw new Exception( 'Unmatched parenthesis' );
                $context = &$contextStack[count($contextStack)-1];
                // reset name accumulator
                $name = '';
                break;
            default:
                // this is part of the field name
                $name .= $s[$i];
        }
    }
    // add any trailing arrays to the context (this will cover the case
    // where our input ends in an array without parents)
    if( $name!='' && !array_key_exists( $name, $context ) )
        $context[$name] = array();
    if( count( $contextStack ) != 1 ) throw new Exception( 'Unmatched parenthesis' );
    return array_pop( $contextStack );
}

原创 JavaScript 解决方案

function parse(s) {
    var root = { parent: null, children: [] };
    var field = { parent: root, name: '', start_idx: 0, children: [] };
    root.children.push( field );
    for( var i=0; i<s.length; i++ ) {
        switch( s[i] ) {
            case ',':
                // if this field didn't have any children, we have to set its text
                if( !field.children.length )
                    field.text = s.substr( field.start_idx, i - field.start_idx + 1 );
                // start a new field; create new field and change context
                var newfield = { parent: field.parent, name: '', start_idx: i, children:[] };
                field.parent.children.push(newfield);
                field = newfield;
                break;
            case '(':
                // start of a subfield; create subfield and change context
                var subfield = { parent: field, name: '', start_idx: i, children:[] };
                field.children.push(subfield);
                field = subfield;
                break;
            case ')':
                // end of a subfield; fill out subfield details and change context
                if( !field.parent ) throw new Error( 'Unmatched parenthesis!' );
                field.text = s.substr( field.start_idx, i - field.start_idx + 1 );
                if( field.text==='()' ) {
                    // empty subfield; pop this subfield so it doesn't clutter the parent
                    field.parent.children.pop();
                }
                field = field.parent;
                break;
            default:
                // this is part of the field name
                field.name += s[i];
                field.name = field.name.trim();
        }
    }
    return root;
}

现在我们已经有了你的语言的解析树,我们可以很容易地创建一些递归代码来发出你的 PHP:

function formatphp_namedarray(arr,indent,lastchild) {
    var php = indent + arr.name + ' => array(';
    if( arr.children.length ) {
        if( arr.children.length===1 && arr.children[0].length===0 ) {
            php += arr.children[0].name;
        } else {
            php += '\n';
            indent += '\t';
            for( var i=0; i<arr.children.length; i++ )
                php += formatphp_namedarray(arr.children[i],indent,i===arr.children.length-1);
            indent = indent.replace(/\t$/,'');
            php += indent;
        }
    }
    php += (lastchild?')':'),') + '\n';
    return php;
}

function formatphp(t) {
    var php = 'array(\n';
    for( var i=0; i<t.children.length; i++ )
        php += formatphp_namedarray( t.children[i], '\t', i===t.children.length-1 );
    php += ')'
    return php;
}       

在这里看到这一切:http: //jsfiddle.net/6bguY/1/

于 2013-07-07T00:15:49.490 回答
1

好的,我认为这可以解决问题:

    $a = "field1,field2(subfield1),field3(subfield2,subfield3),field4(),field5(subfield4(subsubfield,subsubfield2,limit:50,offset:0))";
    $output = array();
    $outputStacktrace = array(&$output);
    $depth = 0;
    $buffer = $key = '';
    $m = memory_get_usage();
    for ($i = 0; $i < strlen($a); $i++)
        if ($a[$i] == ':') {
            $key = $buffer;
            $buffer = '';
        } elseif ($a[$i] == ',') {
            if (strlen($buffer))
                $outputStacktrace[$depth][$key ? $key : count($outputStacktrace[$depth])] = $buffer;
            $buffer = $key = '';
        } elseif ($a[$i] == '(') {
            $outputStacktrace[$depth][$buffer] = array();
            $outputStacktrace[$depth + 1] = &$outputStacktrace[$depth][$buffer];
            $depth++;
            $buffer = '';
        } elseif ($a[$i] == ')') {
            if (strlen($buffer))
                $outputStacktrace[$depth][$key ? $key : count($outputStacktrace[$depth])] = $buffer;
            $buffer = $key = '';
            unset($outputStacktrace[$depth]);
            $depth--;
        } else
            $buffer .= $a[$i];
    var_dump($output);

都在一个循环中。很满意。评论非常感谢!

编辑:包括固定变量名称,固定最后一个非嵌套字符截断,添加“特殊”查询元素:限制和偏移。

<?php
    $output = array();
    $outputStack = array(&$output);
    $depth = 0;
    $buffer = $key = '';
    $s = count($a);
    for ($i = 0; $i < $s; $i++)
        if ($a[$i] == ':') {
            $key = $buffer;
            $buffer = '';
        } elseif ($a[$i] == ',' || $a[$i] == ')' || $i == $s - 1) {
            if ($depth < 4) {
                if ($i == $s - 1)
                    $buffer .= $a[$i];

                if (strlen($buffer))
                    if (strlen($key))
                        if (($key == 'limit' || $key == 'offset') && ((int) $buffer) . "" == $buffer)
                            $outputStack[$depth][$key] = $buffer;
                        else
                            $outputStack[$depth]['filters'][$key] = $buffer;
                    else
                        $outputStack[$depth]['fields'][] = $buffer;
            }

            $buffer = $key = '';

            if ($a[$i] == ')') {
                array_pop($outputStack);
                $depth--;
            }
        } elseif ($a[$i] == '(') {
            if ($depth + 1 < 4) {
                $outputStack[$depth]['connections'][$buffer] = array('fields' => array('id'));
                $outputStack[$depth + 1] = &$outputStack[$depth]['connections'][$buffer];
            }
            $depth++;
            $buffer = '';
        } else
            $buffer .= $a[$i];
    unset($outputStack, $depth, $buffer, $key, $a, $i);
于 2013-07-06T23:36:36.833 回答
0

这是一个简单的解析器,可以解决您的问题:

function parse($input) {    
    preg_match_all('/[,()]|[^,()]+/', $input, $tokens);
    // reference to current node list to work with during traversal
    $ref = &$output;
    // stack to remember node lists
    $stack = array();
    $output = array();
    foreach ($tokens[0] as $token) {
        switch ($token) {
        case '(':
            // push reference to current node list on the stack and
            // update reference
            $stack[] = &$ref;
            $ref = &$ref[$last];
            break;
        case ')':
            // restore previous node list from stack
            $ref = &$stack[count($stack)-1];
            array_pop($stack);
            if (is_null($ref)) echo "error";
            break;
        case ',':
            break;
        default:
            // insert token into current node list
            $ref[$token] = array();
            $last = $token;
            break;
        }
    }
    if (!empty($stack)) echo "error";
    return $ouput
}

$input = 'field1,field2(subfield1),field3(subfield2,subfield3),field4(),field5(subfield4(subsubfield,subsubfield2))';
var_dump(parse($input));

请注意,这只是一个简单的解析器,它没有实现所描述语言的完整有限状态自动机,例如,,后面的 a)必须不存在。

但是如果你愿意,你可以添加一些状态机制。只需让每个阶段都有case自己的阶段,并检查允许哪些先前状态进入新状态。

于 2013-07-07T20:09:06.710 回答