statistics - 学习/检测日志中 URL 的可变部分

Question

假设您有一个网络服务器日志（apache、nginx 等）。从中提取大量 URL：

/article/1/view
/article/2/view
/article/1/view
/article/1323/view
/article/1/edit
/help
/article/1/view
/contact
/contact/thank-you
/article/8/edit
...

或者

/blog/2012/06/01/how-i-will-spend-my-summer-vacation
/blog/2012/08/30/how-i-wasted-my-summer-vacation
...

你将这些 url 分解成它们的片段，这样你就有 ['article', '1323', 'view'] 或 ['blog', '2012', '08', '30', 'how-i-wasted-my -暑假']。

如何分析和比较这些 url 以检测和调用 url 路径中的“变量”。也就是说，您需要识别诸如/article/XXX/view、/article/XXX/edit和/blog/XXX/XXX/XXX/XXX之类的内容，以便您可以在日志中汇总有关这些行的信息。

我假设对于构成可变片段与外观相似但不同的模板的差异数量需要一些统计阈值。我也不确定什么样的数据结构可以让这个分析变得又快又容易。

我希望脚本的输出能够输出它认为服务器上存在的所有 url 模板，如果合适的话，可能带有一些置信度值。

score 2 · Accepted Answer

一个简单的解决方案是计算路径出现次数并了解哪些值对应于模板。假设该文件input包含您的第一个片段中的 URL。然后计算每个路径的访问：

awk -F '/' '{ for (i=2; i<=NF; ++i) { for (j=2; j<=i; ++j) printf "/%s", $j; printf "\n" }}' input \
    | sort \
    | uniq -c \
    | sort -rn

这产生：

7 /article
4 /article/1
3 /article/1/view
2 /contact
1 /help
1 /contact/thank-you
1 /article/8/edit
1 /article/8
1 /article/2/view
1 /article/2
1 /article/1323/view
1 /article/1323
1 /article/1/edit

现在每个路径都有一个权重，您可以将其输入到分数函数f(x, y)中，其中x表示计数，y表示路径的深度。例如，第一行将导致调用f(7,2)并可能返回[0,1]中的值，例如 0.8，以告诉您给定的参数化对应于 80% 的模板。当然，所有的魔法都发生在f中，您必须根据您看到的访问路径提出合理的值。要开发一个好的f，您可以在一些小数据集上使用逻辑回归，看看它是否能很好地预测作为模板的二元特征。

你也可以采取普通的方法：只掉尾巴，例如，所有值 <= 1。

score 1 · Accepted Answer

使用DAWG怎么样？除了节点不会存储字母，而是 URI 片段。像这样：

在此处输入图像描述

这是一个非常好的数据结构：它对内存的要求非常低，很容易遍历，而且，作为一个 DAG，有很多简单且经过充分研究的算法。它也恰好描述了接受样本中所有 URL 并拒绝所有其他 URL 的状态机（因此我们实际上可能会从中构建一个正则表达式，这非常简洁，但我不够聪明，不知道该怎么做从那里开始）。

无论如何，使用这样的结构，您的问题会转化为寻找“瓶颈”的问题。我猜想有合适的算法，但是有足够大的样本，其中变量变化很大，基本上是这样的：在一定深度的节点越多，它就越有可能是一个可变部分。

一种可能很幼稚的方法是这样的：为每个起始部分保留单独的 DAWG，我会找到 DAWG 的平均宽度（可能基于深度加权）。如果一个级别的宽度高于该平均值，我会认为它是一个变量，其概率取决于它与平均值的距离。在这一点上，您可能会很好地释放统计数据的力量。对宽度的分布进行建模。

This approach wouldn't fare well with independent patterns starting with the same part, like "shop/?/?" and "shop/admin/?/edit". This could be perhaps mitigated by examining the DAWG-s in a more dynamic fashion, using a sliding window of sorts, always examining only a part of the DAWG at once, but I don't know how. Oh and, the whole thing fails horribly if the very first part is a variable, but that's thankfully rare.

You may also look out for certain little things like all nodes of the same level having numerical values (more likely to be a variable), and I'd certainly check for common date patterns in the sample before building the DAWGs, factoring them out would make handling the blog-like patterns easier.

(Oh and, adding the "algorithm" tag would probably attract more attention to the question.)

statistics - 学习/检测日志中 URL 的可变部分

2 回答 2

Related

Reference