machine-learning - WEKA：如何区分“缺失”和“不适用”的数字数据？

Question

我是 WEKA 的新手。

在我的数据集中，我有一个类型为数字的属性。在数据集中，有特定的值被表示为“缺失值”和“不适用”。

例如

0 - 缺失值 99999 - 代表不适用

对于“缺失值”，我可以使用“？”来表示它，但是对于“不适用”呢？

我的问题是：- 1）我们如何告诉 WEKA 在计算平均值或标准偏差时不要包含“不适用”值？2）“不适用”值如何影响分类结果？

谢谢你。

score 0 · Accepted Answer

This might actually be a question better suited for stats.stackexchange.com, though I acknowledge that this is a WEKA-specific question. Now, there might be models in WEKA that handle the problem of missing values well. I don't know WEKA, but I there might be decision tree implementations that handle this gracefully for you.

However, you might want to make a couple of more basic considerations first, as missing feature values is a difficult problem. These considerations would have to be made by any automatic functionality in WEKA anyway, so it is probably better to do them beforehand using your domain knowledge..

'Not Applicable' is one of the ways for the feature to be missing. So there may or may not be a distinction between 'missing' and 'not applicable', depending upon your dataset. In calling a value "missing", you are merely saying you do not have the value. Why is it missing?

There are many potential causes for missingness in a feature, some more detrimental than others. In this situation there is mainly three options:

Delete all records which have a missing value
Remove any feature that has a missing value
Replace any missing value with some "guess" at what the value should be. This is called imputation.

The most conservative and safest choice clearly is to simply drop the feature. In doing this, it would be useful to create an extra indicator feature, which can simply indicate whether or no the original feature was missing. This information might be useful in fitting a good model.

In choosing which one of these three approaches to take, there are a couple of things to consider.

Do you know for sure that 99999 is generated from an explicit NA-decision, and not by the same mechanism as 0? By what mechanism is the zeros generated, since you merely describe them as "misssing"?
How common are these feature values indicating missing value? The more missing feature values, the riskier case deletion or feature imputation becomes.
If you believe there is value in imputation, can your domain knowledge help you in choosing the suitable values? For instance, if a value is entered only when it deviates from some value (let's say high blood pressure), and left blank when it lies at the expected level, imputing this value in the missing cases would be reasonable.

machine-learning - WEKA：如何区分“缺失”和“不适用”的数字数据？

1 回答 1

Related

Reference