php - 在 PHP 中标记 CSS 的性能

Question

这是一个从未编写过解析器/词法分析器的人提出的菜鸟问题。

我正在为 PHP 中的 CSS 编写标记器/解析器（请不要重复“OMG，为什么在 PHP 中？”）。W3C在此处 (CSS2.1)和此处 (CSS3, draft)巧妙地编写了语法。

这是 21 个可能的标记的列表，所有（除了两个）都不能表示为静态字符串。

我目前的方法是一遍又一遍地循环遍历包含 21 种模式的数组，并if (preg_match())通过匹配减少源字符串匹配。原则上，这真的很好。然而，对于一个 1000 行的 CSS 字符串，这需要 2 到 8 秒，这对我的项目来说太长了。

现在我正在思考其他解析器如何在几分之一秒内标记和解析 CSS。好吧，C总是比 PHP 快，但是，有什么明显的D'Oh！是我掉进去的吗？

我做了一些优化，比如检查 '@'、'#' 或 '"' 作为剩余字符串的第一个字符，然后只应用相关的正则表达式，但这并没有带来任何巨大的性能提升。

到目前为止我的代码（片段）：

$TOKENS = array(
  'IDENT' => '...regexp...',
  'ATKEYWORD' => '@...regexp...',
  'String' => '"...regexp..."|\'...regexp...\'',
  //...
);

$string = '...CSS source string...';
$stream = array();

// we reduce $string token by token
while ($string != '') {
    $string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the
        // start is insignificant but doing a trim reduces exec time by 25%
    $matches = array();
    // loop through all possible tokens
    foreach ($TOKENS as $t => $p) {
        // The '&' is used as delimiter, because it isn't used anywhere in
        // the token regexps
        if (preg_match('&^'.$p.'&Su', $string, $matches)) {
            $stream[] = array($t, $matches[0]);
            $string = substr($string, strlen($matches[0]));
            // Yay! We found one that matches!
            continue 2;
        }
    }
    // if we come here, we have a syntax error and handle it somehow
}

// result: an array $stream consisting of arrays with
// 0 => type of token
// 1 => token content

score 3 · Accepted Answer

3

使用词法分析器生成器。

于 2010-04-09T20:08:22.717 回答

score 0 · Accepted Answer

我要做的第一件事就是摆脱preg_match(). 诸如此类的基本字符串函数strpos()要快得多，但我认为您甚至不需要它。看起来您正在寻找带有的字符串前面的特定标记preg_match()，然后简单地将该字符串的前面长度作为子字符串。您可以使用简单的方法轻松完成此substr()操作，如下所示：

foreach ($TOKENS as $t => $p)
{
    $front = substr($string,0,strlen($p));
    $len = strlen($p);  //this could be pre-stored in $TOKENS
    if ($front == $p) {
        $stream[] = array($t, $string);
        $string = substr($string, $len);
        // Yay! We found one that matches!
        continue 2;
    }
}

您可以通过预先计算所有令牌的长度并将它们存储在$TOKENS数组中来进一步优化它，这样您就不必一直调用strlen()。如果您按$TOKENS长度分组，您还可以substr()进一步减少调用次数，因为您可以substr($string)对每个令牌长度仅分析一次当前字符串，并在继续之前遍历该长度的所有令牌下一组令牌。

score 0 · Accepted Answer

（可能）更快（但对内存不太友好）的方法是一次标记整个流，使用一个大的正则表达式和每个标记的替代品，比如

 preg_match_all('/
       (...string...)
       |
       (@ident)
       |
       (#ident)
       ...etc
   /x', $stream, $tokens);

 foreach($tokens as $token)...parse

score 0 · Accepted Answer

不要使用正则表达式，逐个字符扫描。

$tokens = array();
$string = "...code...";
$length = strlen($string);
$i = 0;
while ($i < $length) {
  $buf = '';
  $char = $string[$i];
  if ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
    while ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
      // identifier
      $buf .= $char;
      $char = $string[$i]; $i ++;
    }
    $tokens[] = array('IDENT', $buf);
  } else if (......) {
    // ......
  }
}

但是，这使得代码无法维护，因此，解析器生成器更好。

score 0 · Accepted Answer

这是一个旧帖子，但仍然为此贡献了我的 2 美分。严重减慢问题中原始代码的一件事是以下行：

$string = substr($string, strlen($matches[0]));

而不是处理整个字符串，只取其中的一部分（比如 50 个字符），这对于所有可能的正则表达式来说已经足够了。然后，在其上应用相同的代码行。当此字符串缩小到预设长度以下时，向其加载更多数据。

php - 在 PHP 中标记 CSS 的性能

5 回答 5

Related

Reference