1

我有一个看起来像这样的数据集(请注意,每个产品都有一个空格):

Client_ID      Purchase
121212         "Orange_Juice Lettuce"
121212         "Banana Bread "
230102         "Banana Apple"
230102         "Chicken"
121212         "Chicken Bread"
301450         "Grapes Lettuce"
...            ...

现在,我想知道每个人购买什么产品,对每个项目使用一个虚拟变量:

Client_ID    Apple    Banana    Bread    Chicken    Grapes    Lettuce    Orange_Juice
121212       0        1         1        1          0         1          1  
230102       1        1         0        1          0         0          0
301450       0        0         0        0          1         1          0
...          ...      ...       ...      ...        ...       ...        ...

几周前我问了一个类似的问题,但我没有在同一行中有几个项目,就像这里的情况一样。所以我真的迷路了。我试图将项目分成多列,但这并不理想,因为每次购买可以有不同数量的项目(据我所知最多几十个)。

关于如何进行的任何想法?提前致谢!

4

3 回答 3

2

这是使用 PROC FREQ 和 PROC TRANSPOSE 的灵活解决方案。SPARSE 选项让您归零。我假设你只想要 1 或 0,因此 NODUPKEY 排序;如果您确实需要 2 作为第一个 ID 的 BREAD,请删除 NODUPKEY (或完全删除排序)。

首先创建一个垂直数据集,每个 ID/产品有一条记录(将购买拆分为产品);然后 PROC FREQ 那个数据集,这样你就有一个数据集,每个客户/产品组合都是 1/0;然后将其转置为使用产品作为 ID 并计为 VAR。

如果您有任何想要保证显示为零的产品,即使没有人拥有它们,您应该使用虚拟客户端 ID 和所有可能的产品在初始表(或 proc freq 之前的任何内容)中添加一行,然后在转置删除虚拟客户端 ID。

data test;
input @1 Client_ID  6.   @16 Purchase $50.;
datalines;
121212         Orange_Juice Lettuce
121212         Banana Bread 
230102         Banana Apple
230102         Chicken
121212         Chicken Bread
301450         Grapes Lettuce
;;;;
run;

data vert;
set test;
format product $20.;
do _x = 1 by 1 until (missing(product));
  product=scan(purchase,_x);
  if not missing(product) then output;
end;
run;
proc sort data=vert nodupkey;
by client_id product;
run;

proc freq data=vert;
tables client_id*product/sparse out=prods;
run;

proc transpose data=prods out=horiz;
by client_id;
id product;
var count;
run;
于 2012-09-05T13:35:21.853 回答
0

这是一个数据步编程解决方案:

proc sort data=have;
   by client_id;
run;
data want(keep=client_id apple banana bread chicken grapes lettuce orange_juice);
   set have;
      by client_id;
   retain apple banana bread chicken grapes lettuce orange_juice;
   if first.client_id then do;
      apple = 0;
      banana = 0;
      bread = 0 ;
      chicken = 0;
      grapes = 0;
      lettuce = 0;
      orange_juice = 0;
      end;
   length item $20;
   _x = 1;
   item = scan(purchase,_x);
   do while(item ne ' ');
      select(item);
         when('Apple') then apple = 1;
         when('Banana') then banana = 1;
         when('Bread') then bread = 1;
         when('Chicken') then chicken = 1;
         when('Grapes') then grapes = 1;
         when('Lettuce') then lettuce = 1;
         when(("Orange_Juice') then orange_juice = 1;
         otherwise;
         end;
      _x = _x + 1;
      item = scan(purchase,_x);
      end;
   if last.client_id then output;
run;

编辑:PURCHASE我错过了每个变量中多个项目的问题部分。谢谢乔!

于 2012-09-05T13:02:20.507 回答
0

让 SAS 数据步骤为您进行一些虚拟变量编码也是一个可行的解决方案。

data test;
input Client_ID 6. Purchase $50.;
datalines;
121212         Orange_Juice Lettuce
121212         Banana Bread 
230102         Banana Apple
230102         Chicken
121212         Chicken Bread
301450         Grapes Lettuce
 ;;;;
 run;

filename tmp temp;
 data _null_;
 set test end = done;
 file tmp;
 length product $25 prodlist $1000;
 retain prodlist;
 do i = 1 to countw( purchase, " " );
      product = scan( purchase, i, " " );
      prodlist = ifc( indexw( prodlist, product )=0, catx( ' ', prodlist, product ), prodlist );
 end;
 if done then do; 
    prodlinit=prxchange("s/ /=0; /",-1,compbl(prodlist)); 
    put 'array prods(*) ' prodlist ';'  / prodlinit;
 end;
 run;

 data new;
  set test;
   %inc tmp/source2;
   do i = 1 to dim( prods );
     if indexw(purchase,vname(prods(i))) > 0 then prods(i) = 1;
   end; 
  run;

proc print;
run;
于 2012-09-05T22:36:59.137 回答