0

我有一堆由守护程序使用 C、XML 和 XSL 生成的 HTML。然后我有一个 PHP 脚本,它获取 HTML 标记并将其显示在屏幕上

我有大量符合 XHTML 1 的标记。我需要修改标记中的所有链接以删除&utm_source=report&utm_medium=email&utm_campaign=report.

到目前为止,我已经考虑了两种选择。

  1. 在 PHP 后端进行正则表达式搜索,删除 Analytics 代码
  2. 编写一些 Jquery 来遍历链接,然后从 href 中删除 Analytics 代码。

障碍:

  1. HTML 可以是巨大的。IE 超过 4MB(运行了一些测试,它们平均约为 100Kb)
  2. 它必须很快。我们得到大约 3K 的想法?

现在我正在尝试使用str_replace('&utm_source=report&utm_medium=email&utm_campaign=report','',$html);,但它不起作用。

4

8 回答 8

2

You could use sed or some other low level tool to remove that parts:

find /path/to/dir -type f -name '*.html' -exec sed -i 's/&utm_source=report&utm_medium=email&utm_campaign=report//g' {} \;

But that would remove this string anywhere and not just in URLs. So be careful.

于 2009-06-04T15:33:50.773 回答
1

if the string is always the same the fastest php function I;ve found for that is strtr

PHP strtr

string strtr ( string $str , string $from , string $to )

$html = strtr($html, array('&utm_source=report&utm_medium=email&utm_campaign=report' => ''));

Obviously you'll need to benchmark the speed, but that should be up there.

于 2009-06-04T15:30:25.320 回答
0

有了这么大的 HTML 块,我会把它交给一个外部进程,可能是一个 perl 脚本

我并不肯定,因为我从未尝试过解析这么多文本附近的任何地方,但我愿意 PHP 不会很快做到这一点。

您的预期负载是多少?您需要多久进行一次此类处理?这听起来像是您将作为批处理操作执行的操作,根据我承认此类任务的有限经验,它不一定需要超快,但足够快以在合理的时间内执行(即,你不是在一夜之间等待它或其他什么)

于 2009-06-04T15:26:40.087 回答
0

Regex is one way. Alternately you could use XPath to find all links within the document and then work on each of those in a loop. Since this is an XHTML document and if assuming it is well formed, this approach seems reasonable.

于 2009-06-04T15:27:51.207 回答
0

PHP's preg_replace_all() will do this quite fast if you run it in CGI mode in backend. Why not using cronjob to run php script sometimes to process all your HTMLs? So, then your frontend php-script will only put the processed contents to browser without any calculations.

于 2009-06-04T15:33:34.880 回答
0

I eventually deferred to using str_replace and replacing the string through the entire contents of the document :(.

于 2009-06-04T20:25:14.460 回答
0

I encountered this problem a couple of years ago and came up with the following regex to replace any instances of those utm variables in urls:

/(\?|\&)?utm_[a-z]+=[^\&]+/

An example usage:

preg_replace('/(\?|\&)?utm_[a-z]+=[^\&]+/', '', 'http://mashable.com/2010/12/14/android-quick-start-guide/?utm_source=twitterfeed&utm_medium=twitter&utm_campaign=Feed%3A+Mashable+%28Mashable%29');

I blogged about the experience here

于 2012-11-22T03:41:47.687 回答
-1

不是真正的正则表达式,但它可以帮助你(未经测试):

$xmlPrologue = '<?xml version="1.0"?>';
$source = '...'; // you're business

$dom = new DOMDocument($source);
$dom->loadXML($source);

$links = $dom->getElementsByTagName('a');

foreach ($links as $link) {
    list($base, $queryString) = explode('?', $link->getAttribute('href'));

    // read GET parameters inside an array
    parse_str(, $queryString/* assigned by reference */);

    // get rid of unwanted GET params
    unset($queryString['utm_source']);
    unset($queryString['utm_medium']);
    unset($queryString['utm_email']);
    unset($queryString['utm_report']);

    // recompose query string
    $queryString = http_build_query($queryString, null, '&amp;');
    // or (not sure which we'll work the best)
    $queryString = http_build_query($queryString, null, '&');

    // assign the newly cleaned href attribute
    $link->setAttribute('href', $base . '?' . $queryString);
}

$html = $dom->saveXML();

// strip the XML declaration. Puts IE in quirks mode
$html = substr_replace($html, '', 0, strlen($xmlPrologue));
$html = trim($html);

echo $html;
于 2009-06-04T15:27:49.413 回答