1

我在 SAS 中有一个需要转置的数据集。它具有表单 id 日期类型值,我需要将其转换为 id date valueoftype1 valueoftype2 ...

有没有什么有效的方法可以做到这一点?我的数据是huuuge。

例如;

data one; 
input ID date type $ value; 

cards; 
1 2001 A 2
1 2002 A 4
1 2001 B 3
2 2001 B 1
2 2002 A 5
2 2002 C 2
2 2003 C 5
3 2001 B 6
4 2002 B 8
4 2003 B 4
4 2001 A 2
;

我希望将其转换为以下形式;(最后三列是 valA、valB、valC)

1 2001 2 3 .
1 2002 4 . .
2 2001 . 1 .
2 2002 5 . 2
2 2003 . . 5
3 2001 . 6 .
4 2001 2 . .
4 2002 . 8 .
4 2003 . 4 .
4

3 回答 3

2

PROC TRANSPOSE will do this very, very efficiently, I'd venture to say equal to or better than the most efficient method of any other DBMS out there. Your data is already beautifully organized for that method, also. You just need a sort by ID DATE, unless you already have an index for that combination (which if you have billions of records is a necessity IMO). No other solution will come close, unless you have enough memory to put it all in memory - which would be rather insane for that size dataset (even 1 billion records would be a minimum of 7GB, and if you have millions of IDs then it's clearly not a 1 byte ID; i'd guess 25-30 GB or more.)

proc sort data=one;
by id date;
run;
proc transpose data=one out=want;
by id date;
id type;
var value;
run;

A naive test on my system, with the following:

data one; 
do id = 1 to 1e6;
  do date = '01JAN2010'd to '01JAN2012'd;
    type = byte(ceil(ranuni(7)*26)+64);
    value = ceil(ranuni(7)*20);
    output;
  end;
end;
run;
proc sort data=one;
by id date;
run;
proc transpose data=one out=want;
by id date;
id type;
var value;
run;

That dataset is ~20GB compressed (OPTIONS COMPRESS=YES). It took about 4 minutes 15 seconds to write initially, took 11 minutes to sort, and took 45 minutes to PROC TRANSPOSE, writing a ~100GB compressed file. I'd guess that's the best you can do; of those 45 minutes, over 20 were likely writing out (5x bigger dataset will take over 5x the time to write out, plus compression overhead); I was also doing other things at the time, so the CPU time was probably inflated some as it didn't get my entire processor (this is my desktop, a 4 core i5). I don't think this is particularly unreasonable processing time at all.

You might consider looking at your needs, and perhaps a transpose isn't really what you want - do you really want to grow your table that much? Odds are you can achieve your actual goal (your analysis/etc.) without transposing the entire dataset.

于 2013-05-29T13:38:26.463 回答
0
if first.date then a=.;b=.;c=.;d=.; 

必须替换为:

if first.date then do;
    a=.;b=.;c=.;d=.;
end;

或者

if first.date then call missing(a,b,c,d);

也代替

if last.date then do; output; a=.;b=.;c=.;d=.; end;

现在,应该足够了:

if last.date then output;

我想一个数据步总是比大数据上的 PROC TRANSPOSE 更有效。限制是您必须找出转置变量的不同值并为它们创建新变量。我认为这是 PROC TRANSPOSE 的开销 - 它首先找出值。(很抱歉,我编辑了您自己的答案,所以现在可能不清楚是什么问题。)

于 2013-05-30T08:38:25.833 回答
0

另一种数据步骤方法(DOW-loop):

proc sort data = one;
  by ID date;
run;

data two;
  do _n_ = 1 by 1 until(last.date);
    set one;
    by ID DATE;
    if type = "A" then valA = value;
    else if type = "B" then valB = value;
    else if type = "C" then valC = value;
  end;
  drop value;
run;

在我的系统上,使用一个 Joe 所用数据集大小的 1/10 的数据集,排序需要 2 分钟,使用 proc transpose 需要 9 分 40 秒。DOW 循环在 7 分 4 秒内完成了同样的事情。在这个特定的场景中,它并不令人印象深刻,但它比 proc transpose 有一个很大的优势:您可以使用它在一次传递中转置多个变量。这是我使用的代码:

data one; 
do id = 1 to 1e5;
  do date = '01JAN2010'd to '01JAN2012'd;
    type = byte(ceil(ranuni(7)*26)+64);
    value = ceil(ranuni(7)*20);
    output;
  end;
end;
run;

data two;
  do _n_ = 1 by 1 until(last.DATE);
    set one;
    array vals[26] val65-val90;
    by ID DATE;
    do i = 1 to 26;
      if type = byte(64 + i) then vals[i] = value;
    end;
  end;
  drop value i;
run;

动态重命名所有 26 个转置类型变量有点棘手,但这可以通过调用执行来完成:

data _null_;
  call execute('proc datasets lib = work nolist;');
  call execute('modify two;');
  call execute('rename');
  do i = 1 to 26;
    call execute(compress('val' || i + 64) || ' = ' || compress('val' || byte(64+i)));
  end;
  call execute(';');
  call execute('run;');
  call execute('quit;');
run;
于 2013-06-04T17:17:56.393 回答