r - 如何识别是否有任何分类变量应该编码为data.frame中的因素？

Question

我有以下data.frame。如何识别是否有任何分类变量应编码为 data.frame 中的因素？

  YEAR  PBE  CBE  PPO  CPO  PFO DINC  CFO RDINC RFP
1  1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9  68.5 877
2  1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1  69.6 899
3  1927 63.0 53.7 59.9 66.8 65.5 52.1 90.9  70.2 883
4  1928 71.0 48.1 56.3 69.9 64.8 52.7 90.9  71.9 884
5  1929 71.0 49.0 55.0 68.7 65.6 55.1 91.1  75.2 895
6  1930 74.2 48.2 59.6 66.1 62.4 48.8 90.7  68.3 874
7  1931 72.1 47.9 57.0 67.4 51.4 41.5 90.0  64.0 791
8  1932 79.0 46.0 49.5 69.7 42.8 31.4 87.8  53.9 733
9  1933 73.1 50.8 47.3 68.7 41.6 29.4 88.0  53.2 752
10 1934 70.2 55.2 56.6 62.2 46.4 33.2 89.1  58.0 811
11 1935 82.2 52.2 73.9 47.7 49.7 37.0 87.3  63.2 847
12 1936 68.4 57.3 64.4 54.4 50.1 41.8 90.5  70.5 845
13 1937 73.0 54.4 62.2 55.0 52.1 44.5 90.4  72.5 849
14 1938 70.2 53.6 59.9 57.4 48.4 40.8 90.6  67.8 803
15 1939 67.8 53.9 51.0 63.9 47.1 43.5 93.8  73.2 793
16 1940 63.4 54.2 41.5 72.4 47.8 46.5 95.5  77.6 798
17 1941 56.0 60.0 43.9 67.4 52.2 56.3 97.5  89.5 830

这是一个正确的答案吗？

是的！factor(beef$PBE) 有 14 个级别，factor(beef$PPO) 有 16 个级别，factor(beef$CFO) 有 15 个级别，其余的不能编码为因子，因为它们有完整的 17 个级别。

score 2 · Accepted Answer

您的确切问题是：“我如何识别是否有任何分类变量应编码为 data.frame 中的因素？”。在这里使用“应该”是关键部分。为什么我们将事物编码为因子呢？

当数据被限制在多个离散级别时使用因子，例如交通灯颜色的“红色”、“橙色”或“绿色”（是的，一些国家有“红色+橙色”或“闪烁橙色”以及，将这些添加到您的因子水平）。当数据是分类的但具有定义的顺序时，使用有序因子，例如“小”、“中”、“大”或“特大”。

如果您的数据是数字，它很可能应该保留为数字，除非它已经是基础类别的数字编码（例如 1=男性，2=女性）。除非您只有几个值并且使用分类方法进行统计分析比连续数值方法更有意义，否则几乎没有理由将任何其他数值转换为因子。

score 1 · Accepted Answer

Spacedman 提出了一些非常好的观点，即从数字数据中随意创建因子通常是不可取的。不过，对于可视化或某些建模方法，它可能很有用。我使用下面的实用函数来替换按（有序）因子中具有很少不同条目的列data.frame，我在下面发布了一个示例：

make_factors <- function(data, max_levels=15) {
    # convert all columns in <data> that are not already factors
    # and that have fewer than <max_levels> distinct values into factors. 
    # If the column is numeric, it becomes an ordered factor.

    stopifnot(is.data.frame(data))
    for(n in names(data)){
        if(!is.factor(data[[n]]) && 
                length(unique(data[[n]])) <= max_levels) {
            data[[n]] <- if(!is.numeric(data[[n]])){
                 as.factor(data[[n]])
            } else {
                 ordered(data[[n]])
            }    
        }
    }
    data
}


# create dataset with one numeric column <foo> with few  distinct entries 
# and one character column <baz> with few  distinct entries :
data <- iris
data <- within(data, {
     foo <- round(iris[, 1])
     baz <- as.character(foo)
})   


table(data$foo)

## 4  5  6  7  8 
## 5 47 68 24  6 

str(data)

## 'data.frame':    150 obs. of  7 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ baz         : chr  "5" "5" "5" "5" ...
##  $ foo         : num  5 5 5 5 5 5 5 5 4 5 ...

str(make_factors(data))

## 'data.frame':    150 obs. of  7 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ baz         : Factor w/ 5 levels "4","5","6","7",..: 2 2 2 2 2 2 2 2 1 2 ...
##  $ foo         : Ord.factor w/ 5 levels "4"<"5"<"6"<"7"<..: 2 2 2 2 2 2 2 2 1 2 ...

r - 如何识别是否有任何分类变量应该编码为data.frame中的因素？

2 回答 2

Related

Reference