r - 使用 Haven 导入 Stata 数据后访问变量标签的便捷方式

Question

在 R 中，一些包（例如）向变量（例如haven）插入label属性haven，这解释了变量的实质名称。例如，gdppc可能有标签GDP per capita.

这非常有用，尤其是在从 Stata 导入数据时。但是，我仍然很难知道如何在我的工作流程中使用它。

如何快速浏览变量和变量标签？现在我必须做attributes(df$var)，但这并不方便瞥见（a la names(df)）
如何在绘图中使用这些标签？同样，我可以使用attr(df$var, "label")来访问字符串标签。不过，这似乎很麻烦。

是否有任何官方方法可以在工作流程中使用这些标签？我当然可以编写一个包含的自定义函数attr，但是将来当包以不同方式实现该label属性时它可能会中断。因此，理想情况下，我想要haven（或其他主要软件包）支持的官方方式。

score 16 · Accepted Answer

16

来自 tidyverse的带有purrr 包的解决方案：

df %>% map_chr(~attributes(.)$label)

于 2017-04-13T14:46:44.840 回答

score 8 · Accepted Answer

在一个简单的函数中使用 sapply 来返回一个变量列表，就像在 Stata 的变量窗口中一样：

library(dplyr)
makeVlist <- function(dta) { 
     labels <- sapply(dta, function(x) attr(x, "label"))
      tibble(name = names(labels),
             label = labels)
}

score 4 · Accepted Answer

这是rio中提到的创新之一（完全披露：我写了这个包）。基本上，它提供了多种导入变量标签的方式，包括haven的做事方式和foreign的方式。这是一个简单的例子：

首先制作一个可重现的示例：

> library("rio")
> export(iris, "iris.dta")

foreign::read.dta()使用（通过）导入rio::import()：

> str(import("iris.dta", haven = FALSE))
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "datalabel")= chr ""
 - attr(*, "time.stamp")= chr "15 Jan 2016 20:05"
 - attr(*, "formats")= chr  "" "" "" "" ...
 - attr(*, "types")= int  255 255 255 255 253
 - attr(*, "val.labels")= chr  "" "" "" "" ...
 - attr(*, "var.labels")= chr  "" "" "" "" ...
 - attr(*, "version")= int -7
 - attr(*, "label.table")=List of 1
  ..$ Species: Named int  1 2 3
  .. ..- attr(*, "names")= chr  "setosa" "versicolor" "virginica"

使用其本机变量属性读取，haven::read_dta()因为属性存储在 data.frame 级别而不是变量级别：

> str(import("iris.dta", haven = TRUE, column.labels = TRUE))
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     :Class 'labelled'  atomic [1:150] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "labels")= Named int [1:3] 1 2 3
  .. .. ..- attr(*, "names")= chr [1:3] "setosa" "versicolor" "virginica"

haven::read_dta()使用我们（rio 开发人员）发现更方便的替代方法阅读：

> str(import("iris.dta", haven = TRUE))
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "var.labels")=List of 5
  ..$ Sepal.Length: NULL
  ..$ Sepal.Width : NULL
  ..$ Petal.Length: NULL
  ..$ Petal.Width : NULL
  ..$ Species     : NULL
 - attr(*, "label.table")=List of 5
  ..$ Sepal.Length: NULL
  ..$ Sepal.Width : NULL
  ..$ Petal.Length: NULL
  ..$ Petal.Width : NULL
  ..$ Species     : Named int  1 2 3
  .. ..- attr(*, "names")= chr  "setosa" "versicolor" "virginica"

通过将属性移动到 data.frame 的级别，使用attr(data, "label.var")等更容易访问它们，而不是挖掘每个变量的属性。

注意：属性的值将为 NULL，因为我只是将本机 R 数据集写入本地文件以使其可重现。

score 3 · Accepted Answer

带有标签包的简单解决方案（tidyverse）

descriptions <- var_label(data_raw) %>% 
  as_tibble() %>% 
  gather(key = variable, value = description)

score 1 · Accepted Answer

1

使用 Haven 包强制到一个因素

haven::as_factor(df$var, levels="label")

于 2021-06-16T12:34:26.470 回答

score 1 · Accepted Answer

标记包的目的是提供方便的函数来操作使用haven.

此外，函数lookfor和包中describe的函数questionr也可用于显示变量和值标签。

r - 使用 Haven 导入 Stata 数据后访问变量标签的便捷方式

6 回答 6

Related

Reference