php - PHP 正则表达式的痛点

Question

因此，我编写了一小段代码来为我们的移动网站将表格转换为 div。

这是代码的摘录：

function replaceTables($table, $html) {

            $tempTable = preg_replace('/<table[^>]*>(.*?)<\/table>/is', '<div style="width: 90%; margin: auto;">$1</div><div style="clear: both;"></div>', $table);
            $html = str_replace($table, $tempTable, $html);

            preg_match_all('/(?!<table[^>]*>).*?<tr[^>]*>.*?<\/tr>.*?(?<!<\/table>)/is', $tempTable, $rows, PREG_OFFSET_CAPTURE);

            for ($i = 0; $i < count($rows[0]); $i++) {
                $tempRow = $rows[0][$i][0];

                preg_match_all('/(?!<table[^>]*>).*?<td[^>]*>.*?<\/td>.*?(?<!<\/table>)/is', $tempRow, $cols, PREG_OFFSET_CAPTURE);

                $numCols = count($cols[0]);
                $colWidth = 100/$numCols;

                for ($x = 0; $x < $numCols; $x++) {
                    $tempCol = $cols[0][$x][0];
                    $cols[0][$x][0] = preg_replace('/<td[^>]*>(.*?)<\/td>/is', '<div style="width: ' . $colWidth . '%; float: left;">$1</div>', $cols[0][$x][0]);
                    $tempRow = str_replace($tempCol, $cols[0][$x][0], $tempRow);
                }

                $tempRow = preg_replace('/<tr[^>]*>(.*?)<\/tr>/is', '<div style="clear: both;">$1</div>', $tempRow);
                $tempTable = str_replace($rows[0][$i][0], $tempRow, $tempTable);
            }

            $html = str_replace($table, $tempTable, $html);

            return $html;
        }

        if ($mobile && $page->type_id != 16) {
            // replace tables with divs for better mobile support

            preg_match_all('/<table[^>]*>.*?<\/table>/is', $this->html, $tables, PREG_OFFSET_CAPTURE);

            for ($y = 0; $y < count($tables[0]); $y++) {
                preg_match_all('/<table[^>]*>.*?<\/table>/is', $tables[0][$y][0], $nestedTables, PREG_OFFSET_CAPTURE);

                if (count($nestedTables[0]) > 0) {
                    //echo count($nestedTables[0]) . "<br />";
                    //print_r($nestedTables[0][0][0]);
                    for ($y = 0; $y < count($nestedTables[0]); $y++) {
                        $this->html = replaceTables($nestedTables[0][$y][0], $this->html);
                    }
                }
                $this->html = replaceTables($tables[0][$y][0], $this->html);
            }
            //$this->html = preg_replace('/<table[^>]*>(.*?)<\/table>/is', '<div style="width: 90%; margin: auto;">$1<div style="clear: both;"></div></div>', $this->html);
        }
        return $this->html;

我遇到了嵌套表的问题，正则表达式正在查找第一次出现的结束表标记，而不是我需要它找到的那个。

如果有人可以引导我使用更好的正则表达式或不同的解决方案来用 div 替换表格，那就太好了。解决方案必须是通过操作字符串，这样就不必对我们的模板系统进行大修。

谢谢

score 1 · Accepted Answer

正如许多人所说，使用正则表达式解析 HTML 不太可能是理想的方法。尽管如此，我还是做了一些研究以试图提供帮助，假设您出于某种原因必须使用这种方法。

听起来您可能会遇到与 PHP 如何解释正则表达式模式的贪婪有关的问题。我看到您使用了很多?量词，这可能会使其运行非贪婪搜索（基于我在http://php.net/manual/en/reference.pcre.pattern.modifiers 上阅读的内容。至少php ）。U您可以通过在部分或全部正则表达式模式上使用修饰符来解决此问题。这将反转贪婪，这可能会使您的?量词再次变得贪婪。

也就是说，您在那里有一组复杂的正则表达式检查，因此这肯定有可能导致一些意外行为。我建议你测试看看。

U作为参考，您可以通过在正则表达式关闭之后放置修饰符来调用修饰符/，就像您在某些地方所做i的那样。s

score 0 · Accepted Answer

对我来说最有效的是使用以下步骤处理 HTML 内容：

使用 UTF-8 将内容转换为 UTF-8（utf8_encode($s)如果尚未使用 UTF-8）。
使用转换为 XHTML$tidy->repairFile($file, array('output-xhtml'=>true), 'utf8');
使用构建 DOM$sx = simplexml_load_file($file, 'SimpleXMLElement', LIBXML_NOENT);
使用解析 DOM$sx->xpath($xpath);

我希望这会有所帮助！

php - PHP 正则表达式的痛点

2 回答 2

Related

Reference