0

我正在寻找创建一个最佳的分桶宏。我的第一个障碍是创建等距桶。我以 sashelp.baseball 数据集为例。

我取 logalary 的范围并将其除以 100 以创建每个桶之间的距离。然后我想为logsalary列分配一个桶值,如果logsalary小于桶值

附上我试过的代码。我希望能够加入或合并存储桶限制值并使用大于或小于子句来附加存储桶值

/*Sort the baseball dataset by smallest to largest, removing any missing data*/
PROC SORT
    DATA = sashelp.baseball
        (KEEP = logsalary
         WHERE = (NOT MISSING(logsalary)))
    OUT  = baseball;
    BY logsalary;
RUN;

/*Identify the size of each bucket by splitting the range into 100 equidistant buckets*/
DATA _NULL_;
    RETAIN bin_size;
    SET baseball        END = EOF;

    IF _N_ = 1 THEN DO;
        bin_size = logsalary;
        CALL SYMPUT("min_bin",logsalary);
    END;
    IF EOF      THEN DO;
        bin_size = ((logsalary - bin_size) / 100);
        CALL SYMPUT("bin_size",bin_size);
    END;
RUN;

/*Create a vector to identify each bucket range*/
DATA bin_levels;
    DO bin = 1 TO 100;
        IF bin = 1 THEN DO;
            bin_level = &min_bin.;
            OUTPUT;
        END;
        ELSE DO;
            bin_level = &min_bin. + &bin_size. * bin;
            OUTPUT;
        END;
    END;
RUN;

/*Append a bucket number based on the logsalary being smaller than the next bucket value*/
PROC SQL;
    CREATE TABLE binned_data    AS
    SELECT
          a.*
        , b.bin
        , b.bin_level
    FROM
          baseball              a
    LEFT JOIN
          bin_levels            b   ON b.bin_level > a.logsalary
    ;
QUIT;

我希望前十行看起来像这样

logSalary     bin
4.2121275979  1
4.2195077052  1
4.248495242   1
4.248495242   1
4.248495242   1
4.248495242   1
4.248495242   1
4.3174881135  2
4.3174881135  2
4.3174881135  2
...

提前致谢

编辑:现在,我将采用这个解决方案

DATA bucketed_data;
    RETAIN bin bin_limit;
    SET baseball;

    IF _n_ = 1 THEN DO;
        bin_limit = logsalary;
        bin = 1;
    END;
    IF logsalary > bin_limit THEN DO;
        bin_limit + &bin_size.;
        bin + 1;
    END; 
RUN;
4

1 回答 1

1

无需宏变量将值放入数据集中,并将数据集与您要分箱的数据集结合起来。让我们使用 10 个 bin 而不是 100 个,以便更容易检查结果。

首先找到最小值和范围:

proc means n min max data=sashelp.baseball;
 var logsalary;
 output out=stats(keep=min range) min=min range=range;
run;

然后使用这些来分箱数据:

DATA bucketed_data;
  SET sashelp.baseball (keep=logsalary);
  if _n_=1 then set stats;
  if not missing(logsalary) then do bin=1 to 10 while(logsalary > min+bin*(range/10)); 
     * nothing to do here ;
  end;
run;

让我们使用 PROC MEANS 来看看它是如何工作的。

proc means n min max ;
 class bin / missing;
 var logsalary;
run;

结果:

在此处输入图像描述

于 2019-07-09T19:59:25.823 回答