php - 在 PHP 中有效地解析 Apache 日志

Question

好的，这就是场景：我需要解析我的日志，以查找下载了多少次图像缩略图而不实际查看“大图像”页面......这基本上是一个基于“拇指”比率的盗链保护系统“完整”图像视图

考虑到服务器不断受到对缩略图的请求的轰炸，最有效的解决方案似乎是使用缓冲的 apache 日志，每隔 1Mb 写入一次磁盘，然后定期解析日志

我的问题是：如何在 PHP 中解析 apache 日志以保存数据，以下是正确的：

日志将被实时使用和更新，我需要我的 PHP 脚本能够在完成此操作时读取它
php 脚本必须“记住”它读取的日志的哪些部分，以免重复读取同一部分并歪曲数据
内存消耗应该最少，因为日志可以在几个小时内轻松达到 10Gb 的数据

php 记录器脚本将每 60 秒调用一次，并在此期间处理它可以处理的任何数量的日志行。

我试过一起破解一些代码，但我在使用最少的内存时遇到问题，找到一种方法来跟踪具有“移动”文件大小的指针

这是日志的一部分：

212.180.168.244 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441268.jpg HTTP/1.1" 200 3072 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
122.53.168.123 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441276.jpg HTTP/1.1" 200 3007 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
143.22.203.211 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441282.jpg HTTP/1.1" 200 4670 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"

在此处附上代码供您查看：

<?php
//limit for running it every minute
error_reporting(E_ALL);
ini_set('display_errors',1);
set_time_limit(0);
include(dirname(__FILE__).'/../kframework/kcore.class.php');
$aj = new kajaxpage;
$aj->use_db=1;
$aj->init();
$db=kdbhandler::getInstance();
$d=kdebug::getInstance();
$d->debug=TRUE;
$d->verbose=TRUE;

$log_file = "/var/log/nginx/access.log"; //full path to log file when run by cron
$pid_file = dirname(__FILE__)."/../kframework/cron/cron_log.pid";
//$images_id = array("8308086", "7485151", "6666231", "8343336");

if (file_exists($pid_file)) {
    $pid = file_get_contents($pid_file);
    $temp = explode(" ", $pid);
    $pid_timestamp = $temp[0];
    $now_timestamp = strtotime("now");
    //if (($now_timestamp - $pid_timestamp) < 90) return;
    $pointer = $temp[1];
    if ($pointer > filesize($log_file)) $pointer = 0;
}
else $pointer = 0;

$pattern = "/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})[^\[]*\[([^\]]*)\][^\"]*\"([^\"]*)\"\s([0-9]*)\s([0-9]*)(.*)/";
$last_time = 0;
$lines_processed=0;

if ($fp = fopen($log_file, "r+")) {
    fseek($fp, $pointer);
    while (!feof($fp)) {
        //if ($lines_processed>100) exit;
        $lines_processed++;
        $log_line = trim(fgets($fp));
        if (!empty($log_line)) {
            preg_match_all($pattern, $log_line, $matches);
            //print_r($matches);
            $size = $matches[5][0];
            $matches[3][0] = str_replace("GET ", "", $matches[3][0]);
            $matches[3][0] = str_replace("HTTP/1.1", "", $matches[3][0]);
            $matches[3][0] = str_replace(".jpg/", ".jpg", $matches[3][0]);
            if (substr($matches[3][0],0,3) == "/t/") {
                $get = explode("-",end(explode("/",$matches[3][0])));
                $imgid = $get[0];
                $type='thumb';
            }
            elseif (substr($matches[3][0], 0, 5) == "/img/") {
                $get1 = explode("/", $matches[3][0]);
                $get2 = explode("-", $get1[2]);
                $imgid = $get2[0];
                $type='raw';
            }
            echo $matches[3][0];
            // put here your sql insert or update
            $imgid=(int) $imgid;
            if (isset($type) && $imgid!=1) {
                switch ($type) {
                    case 'thumb':
                        //use the second slave in the registry
                        $sql=$db->slave_query("INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1 ",2);
                        echo "INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1";
                    break;
                    case 'raw':
                        //use the second slave in the registry
                        $sql=$db->slave_query("INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1",2);
                        echo "INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1";
                    break;
                }
            }

            // $imgid - image ID
            // $size - image size

            $timestamp = strtotime("now");
            if (($timestamp - $last_time) > 30) {
                file_put_contents($pid_file, $timestamp . " " . ftell($fp));
                $last_time = $timestamp;
            }
        }
    }
    file_put_contents($pid_file, (strtotime("now") - 95) . " " . ftell($fp));
    fclose($fp);
}

?>

score 1 · Accepted Answer

也许您可以调整我的 PHP 版本的 tail 来搜索您的最后一个时间戳，而不是计算行数，然后从该点读取行，一一处理？

尾处理大文件

因为我有点好奇，所以我会自己尝试一下，但不幸的是现在无法这样做:(

score 0 · Accepted Answer

一种解决方案是将日志存储到 mysql 数据库中。也许你可以写一个C语言程序来解析日志文件，把它存储在mysql中。它会更快一个数量级，而且不是很困难。另一种选择是使用 phyton，但我认为使用数据库是必要的。您可以使用全文索引来匹配您的字符串。Python 也可以编译成二进制文件。这使它更有效。根据请求：日志文件堆叠增量。不是你一次给10GB。

score 0 · Accepted Answer

我知道这个答案已经晚了，但仍然可以提供帮助（代码总是可以改进的）。

10Gb 的文件大小和所需的内存听起来像是您的主要问题。Apache 确实支持多个日志文件，多个日志文件的真正威力来自创建不同格式的日志文件的能力http://httpd.apache.org/docs/1.3/multilogs.html

创建第二个日志文件，其中仅包含实时日志监控所需的最少数据。在这种情况下，您可能能够首先从日志中删除用户代理字符串等。

根据您的示例日志行，这可能会使 PHP 必须加载的数据量减半。

score 0 · Accepted Answer

我会亲自将日志条目发送到正在运行的脚本。Apache 将通过使用管道 (|) 开始日志的文件名来允许这样做。如果这不起作用，您也可以创建一个 fifo（请参阅 mkfifo）。

正在运行的脚本（无论它是什么）可以缓冲 x 行并基于此执行它需要执行的操作。读取数据并不是那么难，也不应该成为您的瓶颈所在。

我确实怀疑您在数据库上的 INSERT 语句会遇到问题。

php - 在 PHP 中有效地解析 Apache 日志

4 回答 4

Related

Reference