0

我正在通过 Amazon Web Services 获取内容(例如产品描述)。由于来自亚马逊的内容通常标记得很差,最终会弄乱我的网页布局。所以,我想出了一个使用 HTML Tidy 来“清理”内容的功能。

奇怪的是,当我将它与我的应用程序分开测试时,一切似乎都运行良好。但在我的应用程序(在 CodeIgniter 上运行)中,该函数似乎返回奇数字符。

下面的代码是我的测试脚本。它输出我认为我需要的东西。

在我的应用程序中,我从数据库中获取描述,对其进行清理,然后将其显示在我的网页上。例如,在清理之后,document’s(您可以在下面的示例中看到这个词)变为document’s(再次,仅在实际应用中;不在测试代码中。两个功能是相同的)。

任何想法为什么?这是我的测试功能:

    $amazon_content = <<<AMAZON
JavaScript is the brains of your Web page—it enables you to modify a document’s structure, styling, and content in response to user actions without requesting new pages from the server. Scriptin' with JavaScript and Ajax teaches you how to master this powerful and elegant language so you can develop intuitive user interactions that take the user experience to new levels of sophistication and responsiveness.<br><br>Today’s application-like Web experiences (such as Salesforce.com and Google Maps) and Web 2.0 sites (such as Flickr.com and Twitter) are powered by JavaScript and Ajax. Using the techniques shown in this book, you will be able to start creating similar experiences in the sites you design.<br><br>Scriptin' with JavaScript and Ajax will teach you how to:<br><ul><li>Start developing with JavaScript fast!</li></ul><ul><li>Write lightweight but powerful object-oriented code </li></ul><ul><li>Modify the Document Object Model </li></ul><ul><li>“Progressively enhance” your pages with JavaScript to provide the highest levels of accessibility to all users</li></ul><ul><li>Learn sophisticated techniques for making your pages respond to user actions</li></ul><ul><li>Use the downloadable Scriptin’ library of helper functions to speed development and ensure cross-browser compatibility</li></ul><ul><li>Use Ajax scripting techniques to update specific areas of the page with data from the server</li></ul><ul><li>Create powerful interface interactions, such as sliding panels and tree menus</li></ul><ul><li>Evaluate frameworks such as jQuery and Prototype to find the best one for your needs</li></ul><ul><li>Build an online application that looks and responds like a regular desktop application</li></ul><ul><li>Easily adapt the Scriptin’ code examples for use in your own projects—download them at www.scriptinwithajax.com</li></ul><br>
AMAZON;

    echo '<textarea cols="150" rows="12">' . $amazon_content . '</textarea>';
    echo '<textarea cols="150" rows="12">' . get_sanitized_amazon_content($amazon_content) . '</textarea>';
    echo  get_sanitized_amazon_content($amazon_content);

    function get_sanitized_amazon_content($amazon_content)
    {
        $tidy_config             = array(
            'bare' => TRUE,
            'clean' => TRUE,
            'drop-empty-paras' => TRUE,
            'drop-font-tags' => TRUE,
            'drop-proprietary-attributes' => TRUE,
            'enclose-text' => TRUE,
            'fix-backslash' => TRUE,
            'fix-bad-comments' => TRUE,
            'fix-uri' => TRUE,
            'hide-comments' => TRUE,
            'hide-endtags' => TRUE,
            'logical-emphasis' => TRUE,
            'lower-literals' => TRUE,
            'merge-divs' => TRUE,
            'output-xhtml' => TRUE,
            'quote-ampersand' => TRUE,
            'quote-marks' => TRUE,
            'show-body-only' => TRUE,
            'word-2000' => TRUE
        );
        $tidy                    = new tidy();
        $sanitized_amazon_markup = $tidy->repairString($amazon_content, $tidy_config);

        // Replace carriage returns, line feeds, tabs with single space
        $sanitized_amazon_markup = preg_replace('/\r|\n|\t/', ' ', $sanitized_amazon_markup);

        // Removes unnecessary tags
        // TODO: get complete list; put in an array
        $sanitized_amazon_markup = strip_tag($sanitized_amazon_markup, 'div');
        $sanitized_amazon_markup = strip_tag($sanitized_amazon_markup, 'span');

        // Replace double spaces with single space
        $sanitized_amazon_markup = preg_replace('/ {2,}/i', ' ', $sanitized_amazon_markup);

        // Remove leading and trailing space
        $sanitized_amazon_markup = trim($sanitized_amazon_markup);

        return $sanitized_amazon_markup;
    }

    function strip_tag($tagged_content, $tag_name)
    {
        return preg_replace('%<[ \t\r\n]*/?[ \t\r\n]*' . $tag_name . '.*?>%i', '', $tagged_content);
    }

更新:

这是我在我的应用程序中得到的:

<p>JavaScript is the brains of your Web page&acirc;&euro;&quot;it enables you to modify a document&acirc;&euro;&trade;s structure, styling, and content in response to user actions without requesting new pages from the server. Scriptin&#39; with JavaScript and Ajax teaches you how to master this powerful and elegant language so you can develop intuitive user interactions that take the user experience to new levels of sophistication and responsiveness.<br /> <br /> Today&acirc;&euro;&trade;s application-like Web experiences (such as Salesforce.com and Google Maps) and Web 2.0 sites (such as Flickr.com and Twitter) are powered by JavaScript and Ajax. Using the techniques shown in this book, you will be able to start creating similar experiences in the sites you design.<br /> <br /> Scriptin&#39; with JavaScript and Ajax will teach you how to:<br /></p> <ul> <li>Start developing with JavaScript fast!</li> </ul> <ul> <li>Write lightweight but powerful object-oriented code</li> </ul> <ul> <li>Modify the Document Object Model</li> </ul> <ul> <li>&acirc;&euro;&oelig;Progressively enhance&acirc;&euro; your pages with JavaScript to provide the highest levels of accessibility to all users</li> </ul> <ul> <li>Learn sophisticated techniques for making your pages respond to user actions</li> </ul> <ul> <li>Use the downloadable Scriptin&acirc;&euro;&trade; library of helper functions to speed development and ensure cross-browser compatibility</li> </ul> <ul> <li>Use Ajax scripting techniques to update specific areas of the page with data from the server</li> </ul> <ul> <li>Create powerful interface interactions, such as sliding panels and tree menus</li> </ul> <ul> <li>Evaluate frameworks such as jQuery and Prototype to find the best one for your needs</li> </ul> <ul> <li>Build an online application that looks and responds like a regular desktop application</li> </ul> <ul> <li>Easily adapt the Scriptin&acirc;&euro;&trade; code examples for use in your own projects&acirc;&euro;&quot;download them at www.scriptinwithajax.com</li> </ul> <p><br /></p>

这是我在应用程序之外得到的:

<p>JavaScript is the brains of your Web page-it enables you to modify a document's structure, styling, and content in response to user actions without requesting new pages from the server. Scriptin' with JavaScript and Ajax teaches you how to master this powerful and elegant language so you can develop intuitive user interactions that take the user experience to new levels of sophistication and responsiveness.<br /> <br /> Today's application-like Web experiences (such as Salesforce.com and Google Maps) and Web 2.0 sites (such as Flickr.com and Twitter) are powered by JavaScript and Ajax. Using the techniques shown in this book, you will be able to start creating similar experiences in the sites you design.<br /> <br /> Scriptin' with JavaScript and Ajax will teach you how to:<br /></p> <ul> <li>Start developing with JavaScript fast!</li> </ul> <ul> <li>Write lightweight but powerful object-oriented code</li> </ul> <ul> <li>Modify the Document Object Model</li> </ul> <ul> <li>"Progressively enhance" your pages with JavaScript to provide the highest levels of accessibility to all users</li> </ul> <ul> <li>Learn sophisticated techniques for making your pages respond to user actions</li> </ul> <ul> <li>Use the downloadable Scriptin' library of helper functions to speed development and ensure cross-browser compatibility</li> </ul> <ul> <li>Use Ajax scripting techniques to update specific areas of the page with data from the server</li> </ul> <ul> <li>Create powerful interface interactions, such as sliding panels and tree menus</li> </ul> <ul> <li>Evaluate frameworks such as jQuery and Prototype to find the best one for your needs</li> </ul> <ul> <li>Build an online application that looks and responds like a regular desktop application</li> </ul> <ul> <li>Easily adapt the Scriptin' code examples for use in your own projects-download them at www.scriptinwithajax.com</li> </ul> <p><br /></p>
4

1 回答 1

3

-page”和“it”之间的不是一个简单的减号(ascii 0x2d),而是一个长破折号(特别是U+2014 em dash)。用 UTF-8 编码,它是一个三字节序列:0xe2 0x80 0x94。

如果您在Windows-1252 encoding中解释该序列,则会为您提供:

0xe2 => â => &acirc;
0x80 => € => &euro;
0x94 => (some variant of) double quote => &quot;

所以你有一个编码问题。您将 UTF-8 作为输入,但将其解释为 Windows-1252。您的整理是将非 ASCII7 部分转换为 HTML 实体,就像它应该的那样。

至于为什么这发生在您的应用程序内部而不是外部,有几种可能性。一是您在外部和内部没有相同的语言环境/编码配置。另一个是当你在你的应用程序之外进行测试时,你得到的数据并不是来自网络的——即你得到的编码是不同的(可能改变了)。

于 2011-03-06T10:58:16.433 回答