2

免责声明:请注意这个问题的长度。对于一个现实世界的问题,这是一个反复出现的问题,我已经看到它被问了数百次,但从未提出过明确的、可行的解决方案。

我有数百个 HTML 文件,我想使用 PHP 批量缩进。起初我想使用 Tidy,但你应该知道,默认情况下它与 HTML5 标记和属性不兼容,经过一些研究和更多测试后,我想出了以下“假冒”HTML 5 支持的实现:

function Tidy5($string, $options = null, $encoding = 'utf8')
{
    $tags = array();
    $default = array
    (
        'anchor-as-name' => false,
        'break-before-br' => true,
        'char-encoding' => $encoding,
        'decorate-inferred-ul' => false,
        'doctype' => 'omit',
        'drop-empty-paras' => false,
        'drop-font-tags' => true,
        'drop-proprietary-attributes' => false,
        'force-output' => true,
        'hide-comments' => false,
        'indent' => true,
        'indent-attributes' => false,
        'indent-spaces' => 2,
        'input-encoding' => $encoding,
        'join-styles' => false,
        'logical-emphasis' => false,
        'merge-divs' => false,
        'merge-spans' => false,
        'new-blocklevel-tags' => ' article aside audio details dialog figcaption figure footer header hgroup menutidy nav section source summary track video',
        'new-empty-tags' => 'command embed keygen source track wbr',
        'new-inline-tags' => 'btidy canvas command data datalist embed itidy keygen mark meter output progress time wbr',
        'newline' => 0,
        'numeric-entities' => false,
        'output-bom' => false,
        'output-encoding' => $encoding,
        'output-html' => true,
        'preserve-entities' => true,
        'quiet' => true,
        'quote-ampersand' => true,
        'quote-marks' => false,
        'repeated-attributes' => 1,
        'show-body-only' => true,
        'show-warnings' => false,
        'sort-attributes' => 1,
        'tab-size' => 4,
        'tidy-mark' => false,
        'vertical-space' => true,
        'wrap' => 0,
    );

    $doctype = $menu = null;

    if ((strncasecmp($string, '<!DOCTYPE', 9) === 0) || (strncasecmp($string, '<html', 5) === 0))
    {
        $doctype = '<!DOCTYPE html>'; $options['show-body-only'] = false;
    }

    $options = (is_array($options) === true) ? array_merge($default, $options) : $default;

    foreach (array('b', 'i', 'menu') as $tag)
    {
        if (strpos($string, '<' . $tag . ' ') !== false)
        {
            $tags[$tag] = array
            (
                '<' . $tag . ' ' => '<' . $tag . 'tidy ',
                '</' . $tag . '>' => '</' . $tag . 'tidy>',
            );

            $string = str_replace(array_keys($tags[$tag]), $tags[$tag], $string);
        }
    }

    $string = tidy_repair_string($string, $options, $encoding);

    if (empty($string) !== true)
    {
        foreach ($tags as $tag)
        {
            $string = str_replace($tag, array_keys($tag), $string);
        }

        if (isset($doctype) === true)
        {
            $string = $doctype . "\n" . $string;
        }

        return $string;
    }

    return false;
}

它可以工作,但有 2 个缺陷:HTML 注释scriptstyle标签没有正确缩进:

<link href="/_/style/form.css" rel="stylesheet" type="text/css"><!--[if lt IE 9]>
    <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<!--<script type="text/javascript" src="//raw.github.com/kevinburke/tecate/master/tecate.js"></script>-->

</script><script charset="UTF-8" src="//cdnjs.cloudflare.com/ajax/libs/bootstrap-datepicker/1.0.0/js/locales/bootstrap-datepicker.pt.js" type="text/javascript">
</script><!--<script src="/3rd/parsley/i18n/messages.pt_br.js"></script>-->
    <!--<script src="//cdnjs.cloudflare.com/ajax/libs/parsley.js/1.1.10/parsley.min.js"></script>-->
    <script src="/3rd/select2/locales/select2_locale_pt-PT.js" type="text/javascript">
</script><script src="/3rd/tcrosen/bootstrap-typeahead.js" type="text/javascript">

还有另一个更严重的缺陷:Tidy 将所有menu标签转换为ul并坚持删除任何的内联标签,迫使我绕过它。为了清楚地说明这一点,这里有一些例子:

  • <br>空标签
  • <i>text</i>内联标签
  • <i class="icon-home"></i> 的内联标签(来自 Font Awesome 的示例)

如果您检查代码,您会注意到我已经使用了不完美的bhack来说明imenu标记- 我本可以使用更强大的正则表达式,甚至可以完成相同的事情,但对于我的目的来说更快并且够好了。然而,这仍然留下了我没有考虑到的任何其他的内联标签,这很糟糕。 str_replacestr_ireplacestr_replace

所以我转向DOMDocument,但我很快发现为了formatOutput工作,我必须:

  1. 去除标签之间的所有空格(当然使用正则表达式:'~>[[:space:]]++<~m'> ><
  2. 将所有换行符组合转换为例如\n它不会编码\r&#23;
  3. 将输入字符串加载为 HTML,输出为 XML

令我惊讶的是,DOMDocument 也存在空内联标签的问题,基本上,每当它看到<i class="icon-home"></i><someOtherTag>text</someOtherTag>或类似的时候,它就会将其转为<i class="icon-home"><someOtherTag>text</someOtherTag></i>完全打乱页面的浏览器呈现的问题。为了克服这个问题,我发现使用LIBXML_NOEMPTYTAGwithDOMDocument::saveXML()会将任何没有内容的标签(包括真正的空标签,例如<br />)变成内联结束标签,例如:

  • <i class="icon-home"></i>保持不变(应该)
  • <br>变得<br></br>混乱浏览器渲染(再次)

为了解决这个问题,我必须使用一个正则表达式来查找~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~匹配的字符串并将其替换为简单的/>. 另一个主要问题saveXML()是它在我的和内部的 HTML 周围添加了<![CDATA[..]]>块,这使得它们的内容无效,我必须再次返回这些标记。这“有效”:scriptstylepreg_replace

function DOM5($html)
{
    $dom = new \DOMDocument();

    if (libxml_use_internal_errors(true) === true)
    {
        libxml_clear_errors();
    }

    $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
    $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html);

    if ((empty($html) !== true) && ($dom->loadHTML($html) === true))
    {
        $dom->formatOutput = true;

        if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false)
        {
            $regex = array
            (
                '~' . preg_quote('<![CDATA[', '~') . '~' => '',
                '~' . preg_quote(']]>', '~') . '~' => '',
                '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />',
            );

            return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html);
        }
    }

    return false;
}

似乎两种最受推荐和验证的 HTML 缩进方法并不能在野外为 HTML5 产生正确或可靠的结果,我不得不屈服于黑暗之神 Cthulhu

我确实尝试过其他库,例如:

  • html5lib - 无法DOMDocument::$formatOutput工作
  • tidy-html5 - 和正常一样的问题tidy,除了它支持 HTML5 标签/属性

在这一点上,如果不存在更好的解决方案,我正在考虑编写仅适用于正则表达式的东西。但我认为也许DOMDocument可以通过使用自定义 XSLT 来强制使用 HTML5 和script/style标记。我以前从未使用过 XSLT,所以我不知道这是否现实,也许你们中的一位 XML 专家可以告诉我,或许可以提供一个起点。

4

2 回答 2

1

您还没有提到您的意图是为了生产目的还是为了开发而转换页面,例如在调试 HTML 输出时。

如果是后者,并且由于您已经提到编写基于正则表达式的解决方案,我为此目的编写了 Dindent

您没有包含输入和预期输出的样本。您可以使用沙箱测试我的实现。

于 2014-02-22T12:41:52.967 回答
0

为了美化我的 HTML5 代码,我编写了一个小的 PHP 类。它并不完美,但基本上以相对快速的方式完成了我的目的。也许它很有用。

<?php
namespace LBR\LbrService;

/**
 * This script has no licensing-model - do what you want to do with it.
 * 
 * This script is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 *  
 * @author 2014 sunixzs <sunixzs@gmail.com>
 *
 * What does this script do?
 * Take unlovely HTML-sourcecode, remove temporarily any sections that should not 
 * be processed (p.e. textarea, pre and script), then remove all spaces and linebreaks
 * to define them new by referencing some tag-lists. After this intend the new created
 * lines also by refence to tag-lists. At the end put the temporary stuff back to the
 * new generated hopefully beautiful sourcecode.
 *
 */
class BeautifyMyHtml {

    /**
     * HTML-Tags which should not be processed.
     * Only tags with opening and closing tag does work: <example some="attributes">some content</example>
     * <img src="some.source" alt="" /> does not work because of the short end.
     * 
     * @var array
     */
    protected $tagsToIgnore = array (
            'script',
            'textarea',
            'pre',
            'style' 
    );

    /**
     * Code-Blocks which should not be processed are temporarily stored in this array.
     * 
     * @var array
     */
    protected $tagsToIgnoreBlocks = array ();

    /**
     * The tag to ignore at currently used runtime.
     * I had to define this in class and not local in method to get the
     * possibility to access this on anonymous function in preg_replace_callback.
     * 
     * @var string
     */
    protected $currentTagToIgnore;

    /**
     * Remove white-space before and after each line of blocks, which should not be processed?
     *
     * @var boolen
     */
    protected $trimTagsToIgnore = false;

    /**
     * Character used for indentation
     * 
     * @var string
     */
    protected $spaceCharacter = "\t";

    /**
     * Remove html-comments?
     *
     * @var boolen
     */
    protected $removeComments = false;

    /**
     * preg_replace()-Pattern which define opening tags to wrap with newlines.
     * <tag> becomes \n<tag>\n
     * 
     * @var array
     */
    protected $openTagsPattern = array (
            "/(<html\b[^>]*>)/i",
            "/(<head\b[^>]*>)/i",
            "/(<body\b[^>]*>)/i",
            "/(<link\b[^>]*>)/i",
            "/(<meta\b[^>]*>)/i",
            "/(<div\b[^>]*>)/i",
            "/(<section\b[^>]*>)/i",
            "/(<nav\b[^>]*>)/i",
            "/(<table\b[^>]*>)/i",
            "/(<thead\b[^>]*>)/i",
            "/(<tbody\b[^>]*>)/i",
            "/(<tr\b[^>]*>)/i",
            "/(<th\b[^>]*>)/i",
            "/(<td\b[^>]*>)/i",
            "/(<ul\b[^>]*>)/i",
            "/(<li\b[^>]*>)/i",
            "/(<figure\b[^>]*>)/i",
            "/(<select\b[^>]*>)/i" 
    );

    /**
     * preg_replace()-Pattern which define tags prepended with a newline.
     * <tag> becomes \n<tag>
     * 
     * @var array
     */
    protected $patternWithLineBefore = array (
            "/(<p\b[^>]*>)/i",
            "/(<h[0-9]\b[^>]*>)/i",
            "/(<option\b[^>]*>)/i" 
    );

    /**
     * preg_replace()-Pattern which define closing tags to wrap with newlines.
     * </tag> becomes \n</tag>\n
     * 
     * @var array
     */
    protected $closeTagsPattern = array (
            "/(<\/html>)/i",
            "/(<\/head>)/i",
            "/(<\/body>)/i",
            "/(<\/link>)/i",
            "/(<\/meta>)/i",
            "/(<\/div>)/i",
            "/(<\/section>)/i",
            "/(<\/nav>)/i",
            "/(<\/table>)/i",
            "/(<\/thead>)/i",
            "/(<\/tbody>)/i",
            "/(<\/tr>)/i",
            "/(<\/th>)/i",
            "/(<\/td>)/i",
            "/(<\/ul>)/i",
            "/(<\/li>)/i",
            "/(<\/figure>)/i",
            "/(<\/select>)/i" 
    );

    /**
     * preg_match()-Pattern with tag-names to increase indention.
     * 
     * @var string
     */
    protected $indentOpenTagsPattern = "/<(html|head|body|div|section|nav|table|thead|tbody|tr|th|td|ul|figure|li)\b[ ]*[^>]*[>]/i";

    /**
     * preg_match()-Pattern with tag-names to decrease indention.
     * 
     * @var string
     */
    protected $indentCloseTagsPattern = "/<\/(html|head|body|div|section|nav|table|thead|tbody|tr|th|td|ul|figure|li)>/i";

    /**
     * Constructor
     */
    public function __construct() {
    }

    /**
     * Adds a Tag which should be returned as the way in source.
     * 
     * @param string $tagToIgnore
     * @throws RuntimeException
     * @return void
     */
    public function addTagToIgnore($tagToIgnore) {
        if (! preg_match( '/^[a-zA-Z]+$/', $tagToIgnore )) {
            throw new RuntimeException( "Only characters from a to z are allowed as tag.", 1393489077 );
        }

        if (! in_array( $tagToIgnore, $this->tagsToIgnore )) {
            $this->tagsToIgnore[] = $tagToIgnore;
        }
    }

    /**
     * Setter for trimTagsToIgnore.
     *
     * @param boolean $bool
     * @return void
     */
    public function setTrimTagsToIgnore($bool) {
        $this->trimTagsToIgnore = $bool;
    }

    /**
     * Setter for removeComments.
     *  
     * @param boolean $bool
     * @return void
     */
    public function setRemoveComments($bool) {
        $this->removeComments = $bool;
    }

    /**
     * Callback function used by preg_replace_callback() to store the blocks which should be ignored and set a marker to replace them later again with the blocks.
     * 
     * @param array $e
     * @return string
     */
    private function tagsToIgnoreCallback($e) {
        // build key for reference
        $key = '<' . $this->currentTagToIgnore . '>' . sha1( $this->currentTagToIgnore . $e[0] ) . '</' . $this->currentTagToIgnore . '>';

        // trim each line
        if ($this->trimTagsToIgnore) {
            $lines = explode( "\n", $e[0] );
            array_walk( $lines, function (&$n) {
                $n = trim( $n );
            } );
            $e[0] = implode( PHP_EOL, $lines );
        }

        // add block to storage
        $this->tagsToIgnoreBlocks[$key] = $e[0];

        return $key;
    }

    /**
     * The main method.
     * 
     * @param string $buffer The HTML-Code to process
     * @return string The nice looking sourcecode
     */
    public function beautify($buffer) {
        // remove blocks, which should not be processed and add them later again using keys for reference 
        foreach ( $this->tagsToIgnore as $tag ) {
            $this->currentTagToIgnore = $tag;
            $buffer = preg_replace_callback( '/<' . $this->currentTagToIgnore . '\b[^>]*>([\s\S]*?)<\/' . $this->currentTagToIgnore . '>/mi', array (
                    $this,
                    'tagsToIgnoreCallback' 
            ), $buffer );
        }

        // temporarily remove comments to keep original linebreaks
        $this->currentTagToIgnore = 'htmlcomment';
        $buffer = preg_replace_callback( "/<!--(?!\s*(?:\[if [^\]]+]|<!|>))(?:(?!-->).)*-->/ms", array (
                $this,
                'tagsToIgnoreCallback' 
        ), $buffer );

        // cleanup source
        // ... all in one line
        // ... remove double spaces
        // ... remove tabulators
        $buffer = preg_replace( array (
                "/\s\s+|\n/",
                "/ +/",
                "/\t+/" 
        ), array (
                "",
                " ",
                "" 
        ), $buffer );

        // remove comments, if 
        if ($this->removeComments) {
            $buffer = preg_replace( "/<!--(?!\s*(?:\[if [^\]]+]|<!|>))(?:(?!-->).)*-->/ms", "", $buffer );
        }

        // add newlines for several tags
        $buffer = preg_replace( $this->patternWithLineBefore, "\n$1", $buffer ); // tags with line before tag
        $buffer = preg_replace( $this->openTagsPattern, "\n$1\n", $buffer ); // opening tags
        $buffer = preg_replace( $this->closeTagsPattern, "\n$1\n", $buffer ); // closing tags


        // get the html each line and do indention
        $lines = explode( "\n", $buffer );
        $indentionLevel = 0;
        $cleanContent = array (); // storage for indented lines
        foreach ( $lines as $line ) {
            // continue loop on empty lines
            if (! $line) {
                continue;
            }

            // test for closing tags
            if (preg_match( $this->indentCloseTagsPattern, $line )) {
                $indentionLevel --;
            }

            // push content
            $cleanContent[] = str_repeat( $this->spaceCharacter, $indentionLevel ) . $line;

            // test for opening tags
            if (preg_match( $this->indentOpenTagsPattern, $line )) {
                $indentionLevel ++;
            }
        }

        // write indented lines back to buffer
        $buffer = implode( PHP_EOL, $cleanContent );

        // add blocks, which should not be processed
        $buffer = str_replace( array_keys( $this->tagsToIgnoreBlocks ), $this->tagsToIgnoreBlocks, $buffer );

        return $buffer;
    }
}

$BeautifyMyHtml = new \LBR\LbrService\BeautifyMyHtml();
$BeautifyMyHtml->setTrimTagsToIgnore( true );
//$BeautifyMyHtml->setRemoveComments(true);
echo $BeautifyMyHtml->beautify( file_get_contents( 'http://example.org' ) );
?>
于 2014-02-27T15:52:19.813 回答