我是 WEKA 的新手。
在我的数据集中,我有一个类型为数字的属性。在数据集中,有特定的值被表示为“缺失值”和“不适用”。
例如
0 - 缺失值 99999 - 代表不适用
对于“缺失值”,我可以使用“?”来表示它,但是对于“不适用”呢?
我的问题是:- 1)我们如何告诉 WEKA 在计算平均值或标准偏差时不要包含“不适用”值?2)“不适用”值如何影响分类结果?
谢谢你。
我是 WEKA 的新手。
在我的数据集中,我有一个类型为数字的属性。在数据集中,有特定的值被表示为“缺失值”和“不适用”。
例如
0 - 缺失值 99999 - 代表不适用
对于“缺失值”,我可以使用“?”来表示它,但是对于“不适用”呢?
我的问题是:- 1)我们如何告诉 WEKA 在计算平均值或标准偏差时不要包含“不适用”值?2)“不适用”值如何影响分类结果?
谢谢你。
This might actually be a question better suited for stats.stackexchange.com, though I acknowledge that this is a WEKA-specific question. Now, there might be models in WEKA that handle the problem of missing values well. I don't know WEKA, but I there might be decision tree implementations that handle this gracefully for you.
However, you might want to make a couple of more basic considerations first, as missing feature values is a difficult problem. These considerations would have to be made by any automatic functionality in WEKA anyway, so it is probably better to do them beforehand using your domain knowledge..
'Not Applicable' is one of the ways for the feature to be missing. So there may or may not be a distinction between 'missing' and 'not applicable', depending upon your dataset. In calling a value "missing", you are merely saying you do not have the value. Why is it missing?
There are many potential causes for missingness in a feature, some more detrimental than others. In this situation there is mainly three options:
The most conservative and safest choice clearly is to simply drop the feature. In doing this, it would be useful to create an extra indicator feature, which can simply indicate whether or no the original feature was missing. This information might be useful in fitting a good model.
In choosing which one of these three approaches to take, there are a couple of things to consider.