1

我有 600,000 多个观察数据,我想按其邮政编码进行采样(数据中邮政编码的数量与其人口密度成正比)。数据中的关键变量是ZIP CODEIDGROUP

我需要修复我现有的 SAS 代码,以便当 SAS 选择 ZIP CODE 时,它会选择其GROUP中的所有记录。例如,如果选择了ID=2,我也需要ID=1ID=3因此,我在GROUP=1中拥有所有邮政编码。

ID  GROUP   ZIP
1   1   46227
2   1   46227
3   1   46227
4   2   47620
5   3   47433
6   3   47433
7   3   47433
8   4   46135
9   4   46135
10  5   46202
11  5   46202
12  5   46202
13  5   46202
14  6   46793
15  6   46793
16  7   46202
17  7   46202
18  7   46202
19  8   46409
20  8   46409
21  9   46030
22  9   46030
23  9   46030
24  10  46383
25  10  46383
26  10  46383

我有以下 SAS 代码,它将从数据中采样 1000 个 obs,但是它只是随机选择邮政编码而不考虑GROUP变量。

proc freq data=sample;
    tables zip / out=outfreq noprint;
run;

data newfreq error; set outfreq;
    sampnum=(percent*1000)/100;
    _NSIZE_=round(sampnum, 1);
    sampnum=round(sampnum, .01);
    if _NSIZE_=0 then output error;
    if _NSIZE_=0 then delete;
    output newfreq;
run;

data newfreq2; set newfreq error;
    by zip;
    keep zip _NSIZE_;
run;

proc sort data=newfreq2;
    by zip;
run;

proc sort data=sample;
    by zip;
run;

/* proportional stratified sampling */
proc surveyselect data=sample seed=2020 out=sampout sampsize=newfreq2;
    strata zip;
    id id zip;
run;

我希望我能清楚地解释我的问题。如果没有,我将尝试澄清和/或详细说明不清楚的事情。

提前致谢。

4

3 回答 3

1

这是一个似乎有效的尝试。

data test;
    input id group zip;
    cards;
1 1 46227
2 1 46227
3 1 46227
4 2 47620
5 3 47433
6 3 47433
7 3 47433
8 4 46135
9 4 46135
10 5 46202
11 5 46202
12 5 46202
13 5 46202
14 6 46793
15 6 46793
16 7 46202
17 7 46202
18 7 46202
19 8 46409
20 8 46409
21 9 46030
22 9 46030
23 9 46030
24 10 46383
25 10 46383
26 10 46383
;
run;

data test;
    set test;
    rand = ranuni(1200);
run;

proc sort data=test;
    by rand;
run;

/* 10 here is how many cases you want to sample initially */
data test;
    set test;
    if _n_ <= 10 then sample = 1;
    else sample = 0;
run;

proc sort data=test;
    by  group
        descending sample;
run;

data test;
    set test;
    by  group;
    retain keep;    

    if first.group and sample = 1 then keep = 1;
    if first.group and sample = 0 then keep = 0;

    if not first.group then keep = keep;

    drop    rand
            sample;

run;

proc sort data=test;
    by id;
run;

作为奖励,这是一个 R one-liner,它会给出相同的结果:

# 3 here is the number of cases being sampled
test[test$group %in% (test[sample(1:nrow(test),3),]$group),]
于 2012-06-19T22:41:22.660 回答
0

不明白你的意思。您是要对邮政编码进行抽样(并返回每个邮政编码的所有 obs),还是想要按邮政编码分层的样本(意思是每个邮政编码的 N 个 obs)?您可能希望在此处查看SAS/STAT 用户指南中的示例 89.4 。

于 2012-06-19T22:47:22.227 回答
0

这个“按比例分配”的例子在第 1 页。下面引用的文章中的 6 条可能会有所帮助:

proc surveyselect data=frame out=sampsizes_prop sampsize=400;
 strata cityside **/ alloc=prop**;
run;

文章: http ://analytics.ncsu.edu/sesug/2013/SD-01.pdf

于 2016-08-01T22:52:24.383 回答