r - R：预测因子中超过 52 个级别，为打印而截断

Question

嗨，我是 R 编程语言的初学者。我使用 rpart 包为回归树编写了一个代码。在我的数据中，我的一些自变量有 100 多个级别。运行 rpart 函数后，我收到以下警告消息“预测因子中超过 52 个级别，为打印而截断”并且我的树以非常奇怪的方式显示。例如，我的树按位置拆分，大约有 70 个不同的级别，但是当标签显示在树中时，它会显示“ZZZZZZZZZZZZZZZZ......”我没有任何位置叫“ZZZZZZZZ”

请帮我。

提前致谢。

score 3 · Accepted Answer

R 中的许多函数对因子类型变量可以具有randomForest的级别数有限制（即，将因子的级别数限制为 32）。

我看到它处理的一种方法，尤其是在数据挖掘竞赛中是：

1) 确定给定函数允许的最大级别数（调用 this X）。

2)table()用于确定因子每个水平的出现次数，并从大到小排列。

3) 对于X - 1因子的顶层，保持原样。

4）对于级别 <X将它们全部更改为一个因素，以将它们识别为低发生级别。

这是一个有点长但希望有所帮助的示例：

# Generate 1000 random numbers between 0 and 100.
vars1 <- data.frame(values1=(round(runif(1000) * 100,0)))
# Changes values to factor variable.
vars1$values1 <- factor(vars1$values1)
# Show top 6 rows of data frame.
head(vars1)
# Show the number of unique factor levels
length(unique(vars1$values1 ))
# Create table showing frequency of each levels occurrence.
table1 <- data.frame(table(vars1 ))
# Orders the table in descending order of frequency.
table1 <- table1[order(-table1$Freq),]
head(table1)
# Assuming we want to use the CART we choose the top 51
# levels to leave unchanged
# Get values of top 51 occuring levels
noChange <- table1$vars1[1:51]
# we use '-1000' as factor to avoid overlap w/ other levels (ie if '52' was 
# actually one of the levels).
# ifelse() checks to see if the factor level is in the list of the top 51
# levels.  If present it uses it as is, if not it changes it to '-1000'
vars1$newFactor <- (ifelse(vars1$values1 %in% noChange, vars1$values1, "-1000")) 
# Show the number of levels of the new factor column.
length(unique(vars1$newFactor))

最后，您可能需要考虑使用截断的变量，rpart因为当有大量变量或它们的名称很长时，树形显示会变得非常繁忙。

r - R：预测因子中超过 52 个级别，为打印而截断

1 回答 1

Related

Reference