我正在分析口袋妖怪的数据集。我想创建一个随机森林来预测口袋妖怪是否可以成为传奇。
现在,我有一个由 118 个观察值和 44 列组成的训练数据集:
variables:
$ type1_bug : int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_dark : int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_dragon : int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_electric: int 0 1 0 0 0 1 0 0 0 0 ...
$ type1_fairy : int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_fighting: int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_fire : int 0 0 1 0 0 0 1 0 0 1 ...
$ type1_flying : int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_ghost : int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_grass : int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_ground : int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_ice : int 1 0 0 0 0 0 0 0 0 0 ...
$ type1_normal : int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_poison : int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_psychic : int 0 0 0 1 1 0 0 0 1 0 ...
$ type1_rock : int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_steel : int 0 0 0 0 0 0 0 0 0 0 ...
$ type1_water : int 0 0 0 0 0 0 0 1 0 0 ...
$ type2_ : int 0 0 0 1 1 1 1 1 0 0 ...
$ type2_bug : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_dark : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_dragon : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_electric: int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_fairy : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_fighting: int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_fire : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_flying : int 1 1 1 0 0 0 0 0 1 1 ...
$ type2_ghost : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_grass : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_ground : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_ice : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_normal : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_poison : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_psychic : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_rock : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_steel : int 0 0 0 0 0 0 0 0 0 0 ...
$ type2_water : int 0 0 0 0 0 0 0 0 0 0 ...
$ hp : int 90 90 90 106 100 90 115 100 106 106 ...
$ attack : int 85 90 100 150 100 85 115 75 90 130 ...
$ defense : int 100 85 90 70 100 75 85 115 130 90 ...
$ sp_attack : int 95 125 125 194 100 115 90 90 90 110 ...
$ sp_defense : int 125 90 85 120 100 100 75 115 154 154 ...
$ speed : int 85 100 90 140 100 115 100 85 110 90 ...
$ is_legendary : int 1 1 1 1 1 1 1 1 1 1 ...
如您所见,有虚拟变量,但也有目标类is_legendary
问题在于数据不平衡:与传奇口袋妖怪相关的观察数量明显少于非传奇口袋妖怪。因此,我想通过创建合成数据来平衡数据集。有人告诉我,SMOTE function
但我遇到了一个错误。请看下面的整个代码:
#Creating a dataset for legendary pokemon and non legendary pokemon
pokemonllegendari <- df_net[df_net$is_legendary == 1,]
pokemoncomu <- df_net[df_net$is_legendary == 0,]
#Selecting attributes
pokemonllegendari <- pokemonllegendari %>% select(type1,type2,hp,attack,defense,sp_attack,sp_defense,speed,is_legendary)
pokemoncomu<- pokemoncomu %>% select(type1,type2,hp,attack,defense,sp_attack,sp_defense,speed,is_legendary)
#Balancing dataset
pokemoncomusample <- sample_n(pokemoncomu,100)
# Concatenating dataset
rawdata <- rbind(pokemonllegendari,pokemoncomusample)
# Dummy variables
rawdata <- dummy.data.frame(rawdata,sep="_")
# Creating training and test datasets
dt <- sort(sample(nrow(rawdata),nrow(rawdata)*.7))
train <- rawdata[dt,]
test <- rawdata[-dt,]
# Increasing number of legendary pokemons using SMOTE
smoted_data <- SMOTE(is_legendary~., train, perc.over=100)
错误是:
Error in T[i, ] : subscript out of bounds