0

是否有任何函数可以计算混合属性数据集之间的距离。例如,如何计算距离D = d1 - d2?哪里d1(100,TCP,1480)d2(200,ICMP,1650)

4

2 回答 2

0

In engineering and science we make use of dimensionless numbers to describe situations, and use relevant characteristic scales to create those dimensionless numbers. For example, if you were examining turbulent fluid flow you might well be bewildered by the apparently numerous variables. But turbulent fluid flow is dominated by the interplay of momentum acting against viscosity. It can be shown that there are actually only a few key characteristic measures of a system, and the interplay can be expressed as a ratio. The ratio is dimensionless (it is called the Reynolds number). A large value means turbulent flow, a low value means laminar (smooth) flow. This number is therefore a kind of distance function, indicating how distant we are from impeturbable smooth flow. In relativity, distances in space and time canbe expressed as a single distance by converting the time difference to a length by multiplying by the speed if light, then treating that length just like the 3 space dimensions, because the speed of light is a characteristic velocity scale for the situation.

So, you ought to use some domain knowledge to do likewise.

However, you should also stop to ask yourself whether distance is even a meaningful concept. Distance is a measure on a proportional scale: we can speak meaningfully of one distance being twice another distance. If the atrributes you are considering are not measured on proportional scales, to talk about distance is nonsense. I note that your data includes "TCP" and "ICMP", which are unordered, discrete values. Distance might simply be a meaningless concept for your data set.

于 2014-04-24T07:44:57.083 回答
0

如果您碰巧使用了可怕的 KDDCup 1999 数据集,请阅读此答案: https ://stackoverflow.com/a/22522174/1060350 - 数据集没用,所以不要再使用它了。

您可以尝试诸如Gower's distance 之类的距离。但最有可能的是,它们对netflow数据没有任何用处。您应该尝试结合领域知识:回答两个网络流何时相似的问题,然后将其放入等式中;而不是试图找到一个神奇的方程式。

Gower 或任何其他股票距离函数不起作用的原因之一是网络数据具有非常偏斜的分布,并且通常没有负值。它只是不是一个真正的欧几里得空间。

于 2014-04-20T11:04:58.240 回答