1

有时我需要将 SPSS 文件转换为 DTA 文件。通常我使用Stat/Transfer,但我想也许我可以使用 R 来省钱。

但是,当我使用 Haven 包传输文件时,生成的文件大小比我使用 Stat/Transfer 时要大得多

例如,这是我在 Internet 上找到的 .sav 文件。它是 85kb。

使用 Stat/Transfer 对其进行转换以生成更小的 47kb .dta 文件。

但是,当我运行此代码时,我得到一个 118kb 的 .dta 文件。这是 Stat/Transfer 产品的 2.5 倍。

from.sav <- haven::read_sav("PsychBike.sav")
haven::write_dta(from.sav, "PsychBikeFromHaven.dta")

我能做些什么来使输出haven::write_dta()更小吗?

4

1 回答 1

1

这是因为write_dta()没有compress。即,write_dta()经常选择过大的数据存储类型。下面是我工作中的一个极端但真实的例子。(文件名和 varnames 已编辑。)

注意文件大小。它从 1 Mb 减少到 6 kb。尺寸减少 99.4%。真实的数据集实际上有数百万个观察结果——所以我很难将其转换为dta使用write_dta(). 可能需要在ReadStat级别上进行调整。

. desc, size

Contains data from v1.dta
  obs:           100
 vars:            22                          04 Sep 2019 10:19
 size:     1,032,900
-------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
var1            double  %10.0g
var2            str1    %-9s
var3            double  %td
var4            double  %td
var5            str4    %-9s
var6            str1    %-9s
var7            str2045 %-9s
var8            str2045 %-9s
var9            str2045 %-9s
var10           str2045 %-9s
var11           str2045 %-9s
var12           str5    %-9s
var13           double  %10.0g
var14           double  %td
var15           double  %10.0g
var16           str3    %-9s
var17           double  %10.0g
var18           double  %10.0g
var19           double  %10.0g
var20           double  %10.0g
var21           double  %10.0g
var22           str2    %-9s
-------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.
r; t=0.00 10:27:24

. compress
  variable var1 was double now long
  variable var3 was double now int
  variable var4 was double now int
  variable var14 was double now int
  variable var17 was double now byte
  variable var18 was double now long
  variable var19 was double now byte
  variable var20 was double now byte
  variable var7 was str2045 now str1
  variable var8 was str2045 now str1
  variable var9 was str2045 now str1
  variable var10 was str2045 now str1
  variable var11 was str2045 now str1
  (1,026,700 bytes saved)
r; t=0.00 10:27:34

. desc, size

Contains data from v2.dta
  obs:           100
 vars:            22                          04 Sep 2019 10:19
 size:         6,200
-------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
var1            long    %10.0g
var2            str1    %-9s
var3            int     %td
var4            int     %td
var5            str4    %-9s
var6            str1    %-9s
var7            str1    %-9s
var8            str1    %-9s
var9            str1    %-9s
var10           str1    %-9s
var11           str1    %-9s
var12           str5    %-9s
var13           double  %10.0g
var14           int     %td
var15           double  %10.0g
var16           str3    %-9s
var17           byte    %10.0g
var18           long    %10.0g
var19           byte    %10.0g
var20           byte    %10.0g
var21           double  %10.0g
var22           str2    %-9s
-------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.
r; t=0.00 10:27:37
于 2019-09-04T14:35:20.777 回答