r - 处理使用 Haven 导入的标记变量的最佳方法是什么？

Question

我有大约 15 个 SPSS 选举研究文件保存为 .sav 文件。我和我的小组将为每项研究重新编码大约 10 个变量，以运行一些逻辑回归。

我曾经haven()导入所有文件，所以看起来所有变量都属于haven_labelled()该类。

我一直对如何处理这类变量有点困惑，但是我观察到随着 Haven() 和 labelled() 包的更新，性能得到了很多改进，所以我倾向于继续使用它而不是使用，例如rio或foreign。

但我想在我们开始这项工作之前了解最佳实践应该是什么，这样我们就不会后悔。

每个研究文件有大约 200 个变量，混合了因子和数值变量。但首先，我想知道我应该如何重新编码性别变量，以便最终得到一个变量male，其中 1 是男性，0 不是。

我想问的一件事是car::Recode()重新编码变量的方法，而不是dplyr::recode变量方式。我个人觉得dplyr::recode()语法很笨拙，帮助文档也很差。我也不确定设置缺失值的最佳方法。

具体来说，我想我有三个具体问题。

问题 1：是否有令人信服的理由使用dplyr::recode而不是car::Recode？我自己的答案是car::Recode()看起来足够且易于使用。

问题 2：我应该强调将变量转换为因子或数字，还是可以，将变量保留为带有更新值标签的 Have_labelled？我担心 Haven 文档中关于以下内容的引用：''这个类提供了一些方法，因为我希望你会在导入后不久labelled_class强制使用标准 R 类（例如 a ）''factor()

然而，也许haven_labelled该类已经改进并且与标记类有很大不同，因此不再需要强制转换为其他标准 R 类。

labelled问题3：使用（例如na_range()，na_values()）而不是使用car::Recode()方法设置缺失值有什么好处吗？

我的倾向是使用这些方法有明显的缺点labelled，我应该坚持使用这种car::Recode()方法。

谢谢你。

#FAKE DATA
library(labelled)
var1<-labelled(rep(c(1,5), 100), c(male = 1, female = 5))
var2<-labelled(sample(c(1,3,5,7,8,9), size=200, replace=T), c('strongly agree'=1, 'agree'=3, 'disagree'=5, 'strongly disagree'=7, 'DK'=8, 'refused'=9))
#give variable labels
var_label(var1)<-'Respondent\'s sex'
var_label(var2)<-'free trade is a good thing'
df<-data.frame(var1=var1, var2=var2)
str(df)
#This works really well; and I really like this. 
look_for(df, 'sex')
look_for(df, 'free trade')
#the Car way
df$male<-car::Recode(df$var1, "5=0")
#Check results
df$male 
#value labels are still there, so would have to be removed or updated
as_factor(df$male)
#Remove value labels
val_labels(df$male)<-NULL
#Check 
class(df$male) #left with a numeric variable
#The other car way, keeping and modifying value labels
df$male2<-car::Recode(df$var1, "5=0")
df$male2
val_label(df$male2, 0)<-c('female')
val_label(df$male2, 5)<-NULL
val_labels(df$male2)
#Check class
class(df$male2)
#Can run numeric functions on it
mean(df$male2)
#easily convert to factor
as_factor(df$male2)

#How to handle missing values
#The CAR way
#use car to set missing values to NA
df$free_trade<-Recode(df$var2, "8=NA; 9=NA")
#Check class
class(df$free_trade)
#can still run numeric functions on haven_labelled
mean(df$free_trade, na.rm=T)
#table
table(df$free_trade)
#did the na recode work?
table(is.na(df$free_trade))
#check value labels
val_labels(df$free_trade)   

#How to handle missing values
#The CAR way
#use car to set missing values to NA
df$free_trade<-Recode(df$var2, "8=NA; 9=NA")
#Check class
class(df$free_trade)
#can still run numeric functions on haven_labelled
mean(df$free_trade, na.rm=T)
#table
table(df$free_trade)
#did the na recode work?
table(is.na(df$free_trade))
#check value labels
val_labels(df$free_trade)      

#set missing values the labelled way
table(df$var2)
na_values(df$var2)<-c(8,9)
#check
df$var2
#but a table function of does not pick up 8 and 9 as m isisng
table(df$var2)
#this seems to not work very well
table(to_factor(df$var2))
to_factor(df$var2)

score 0 · Accepted Answer

游戏有点晚了，但仍然有一些答案：

我应该强调将变量转换为因子或数字，还是可以，将变量保留为带有更新值标签的 Haven_labelled？

首先，您需要了解 Have_labelled 向量都是数字类型（即它们将被视为连续变量），您可以轻松检查：

library(tidyverse)
df %>%
  as_tibble() %>%
  head()

这使：

# A tibble: 6 x 2
        var1                  var2
   <dbl+lbl>             <dbl+lbl>
1 1 [male]   5 [disagree]         
2 5 [female] 5 [disagree]         
3 1 [male]   3 [agree]            
4 5 [female] 5 [disagree]         
5 1 [male]   7 [strongly disagree]
6 5 [female] 9 [refused]

您是否应该转换为标准类型的问题可能取决于您的分析。

对于简单的频率表，保持原样可能很好，例如

df %>%
  as_tibble() %>%
  count(var1)

# A tibble: 2 x 2
        var1     n
   <dbl+lbl> <int>
1 1 [male]     100
2 5 [female]   100

但是，对于任何类型敏感的分析（已经开始计算均值，还有回归等），您绝对应该将变量转换为适合您的分析的类。不这样做并且将所有事情都视为连续的会产生错误的结果。想想一个真正的分类变量，比如 1=Bus, 2=Car, 3=Bike，你会投入到线性回归中。

使用 dplyr::recode 而不是 car::Recode 是否有令人信服的理由？

现在这里有对错。tidyverse就个人而言，我更喜欢留recode在. 然后你也有很多函数来处理像or or之类的缺失。它们的语法与 dplyr 没有太大区别，所以我想说这主要是个人喜好。if_elsecase_whenreplace_nana_ifcoalescecar::recode

如果您应该使用函数 fromlabelled或不使用，您的问题也是如此。这些labelled包确实添加了一些非常强大的功能来处理标记向量，这些功能超出了什么haven或tidyverse提供的范围，所以 IMO 它是一个很好的包。

r - 处理使用 Haven 导入的标记变量的最佳方法是什么？

1 回答 1

Related

Reference