sql - 用于查找值的 proc sql vs data 步骤形成一个包含异常的引用表

Question

我正在尝试找出特定州特定城市中特定商品的税值。税值在这样的参考表中：

state    city     Good     tax
---------------------------------
all      all      all      0.07
all      all      chicken  0.04
all      jackson  all      0.01
arizona  all      meat     0.02
arizona  phoenix  meat     0.04
arizona  tucson   meat     0.03
hawaii   all      all      0.08
nevada   reno     cigar    0.11
nevada   vegas    cigar    0.13

现在让我们说，如果我正在为（内华达雷诺雪茄）征税，那么参考文献中存在完全匹配，所以答案是 0.11。但是，如果我寻找 (nevada reno chicken) 不存在完全匹配，但 (all all chicken) 可以用作参考，输出将为 0.04。

您能否建议PROC SQL或匹配合并DATA步骤逻辑来处理这种情况？

score 1 · Accepted Answer

这有点长。在这些情况下，我使用哈希对象。在查找树中迭代地“if/then/else”试图找到一个值。

我认为檀香山鸡应该在“夏威夷全鸡”而不是“全鸡”中。

我包含了一个用于创建哈希对象的宏。这将使用您的数据、设置的东西来查找并创建和输出带有查找的税款的表。

data taxes;
informat state $8.   
         city $12.     
         Good $12.    
         tax best.;
input state $ city $ good $ tax;
datalines;
all      all      all      0.07
all      all      chicken  0.04
all      jackson  all      0.01
arizona  all      meat     0.02
arizona  phoenix  meat     0.04
arizona  tucson   meat     0.03
hawaii   all      all      0.08
hawaii   all      chicken  0.11
nevada   reno     cigar    0.11
nevada   vegas    cigar    0.13
;;;
run;

data to_look_up;
informat lu_state $8.   
         lu_city $12.     
         lu_Good $12.  ;
input lu_state $ lu_city $ lu_good $;
datalines;
nevada reno cigar
nevada reno chicken
hawaii honalulu chicken
texas  dallas steak
;;;
run;

%macro create_hash(name,key,data_vars,dataset);
declare hash &name(dataset:&dataset);
%local i n d;
%let n=%sysfunc(countw(&key));
rc = &name..definekey(
    %do i=1 %to %eval(&n-1);
    "%scan(&key,&i)",
    %end;
    "%scan(&key,&i)"
);
%let n=%sysfunc(countw(&data_vars));
%do i=1 %to &n;
    %let d=%scan(&data_vars,&i);
    rc = &name..definedata("&d");
%end;
rc = &name..definedone();
%mend;

data lookup;
set to_look_up;
    format tax best.
         state $8.   
         city $12.     
         Good $12. ;

    if _N_ = 1 then do;
        %create_hash(scg,state city good, tax,"taxes");
    end;

    state = lu_state;
    city =  lu_city;
    good = lu_good;
    tax = .;

    rc = scg.find();
    if missing(tax) then do;
        /*No exact match - check if state/good combo exists*/   
        city = "all";
        rc = scg.find();
        if missing(tax) then do;
            /*No state/good combo -- check state only taxes*/
            good = "all";
            rc = scg.find();
            if missing(tax) then do;
                /*Check good only*/
                good = lu_good;
                state = "all";
                rc = scg.find();
                if missing(tax) then do;
                    /*Default taxes*/
                    good = "all";
                    rc = scg.find();
                end;
            end;
        end;
    end;
run;

score 0 · Accepted Answer

SQL 是连接这些表的理想工具，因为它在连接数据方面最为灵活。
使用 DomPazz 的测试数据；

data taxes;
informat state $8.   
         city $12.     
         Good $12.    
         tax best.;
input state $ city $ good $ tax;
datalines;
all      all      all      0.07
all      all      chicken  0.04
all      jackson  all      0.01
arizona  all      meat     0.02
arizona  phoenix  meat     0.04
arizona  tucson   meat     0.03
hawaii   all      all      0.08
hawaii   all      chicken  0.11
nevada   reno     cigar    0.11
nevada   vegas    cigar    0.13
;;;
run;

data to_look_up;
informat lu_state $8.   
         lu_city $12.     
         lu_Good $12.  ;
input lu_state $ lu_city $ lu_good $;
datalines;
nevada reno cigar
nevada reno chicken
hawaii honalulu chicken
texas  dallas steak
;;;
run;

下面的查询将 to_look_up 表中的每一行连接到 tax 表中的行 state 匹配或 state 等于 tax 表中的“all”，city 匹配或 city 等于 tax 表中的“all”，good 匹配或 good 等于 tax 表中的“all”。

这可能导致 tax 表中的多于 1 行与 to_look_up 表中的一行相匹配。虽然我们可以通过优先匹配来选择最佳匹配，即匹配状态之前的状态等于“所有”，对于城市和良好也是如此。

Group By 子句在这里很重要。它应该是 to_look_up 表中变量的唯一组合。有了这个，我们可以为 to_look_up 表中的每一行选择最佳匹配并消除所有其他匹配。

proc sql;
create table taxes_applied  as

select  *

/*  Prioritise state, city and good matches.                   */
,   case    when to_look_up.lu_state    eq  taxes.state then 2
            when 'all'                  eq  taxes.state then 1
    end                                 as  match_state

,   case    when to_look_up.lu_city     eq  taxes.city  then 2
            when 'all'                  eq  taxes.city  then 1
    end                                 as  match_city

,   case    when to_look_up.lu_good     eq  taxes.good  then 2
            when 'all'                  eq  taxes.good  then 1
    end                                 as  match_good

from    to_look_up

/*  join taxes table on matching state, city and good or matching 'all' rows.  */
left    join
    taxes
on  (       to_look_up.lu_state eq  taxes.state
        or  'all'               eq  taxes.state
    )
and (       to_look_up.lu_city  eq  taxes.city
        or  'all'               eq  taxes.city
    )   
and (       to_look_up.lu_good  eq  taxes.good
        or  'all'               eq  taxes.good
    )   


/*  Process for each row in to_look_up table.  */ 
group   by  to_look_up.lu_state
        ,   to_look_up.lu_city
        ,   to_look_up.lu_good

/*  Select best match.   */ 
having  match_state eq  max (match_state)
and     match_city  eq  max (match_city)         
and     match_good  eq  max (match_good)

order   by  to_look_up.lu_state
        ,   to_look_up.lu_city
        ,   to_look_up.lu_good
        ,   match_state
        ,   match_city
        ,   match_good      
;

quit;

与此类似的连接可用于在汇总表中生成小计。

score -1 · Accepted Answer

如果您只需要做一次（我的意思是不是一个持续的过程），那么一个简单的方法可能是将您的数据集划分为多个数据集。一个数据集将包含所有“所有状态”、“观察”和“良好”的观察结果。另一个将只有州或城市或只有所有的好。另一个数据集将是 state/city 、 city/good 或 state/good 中的两个 ALL 的组合。我猜总共制作了 8 个数据集（包括任何变量中没有 Alls 的数据集。然后，当您知道哪些变量具有 alls 时，您可以相应地合并。例如 - 对于具有 state 、 city 的数据集，您可以拥有没有任何合并的 0.07 税。对于 state 和 city = 'All' 的数据集，您只需要合并好。

sql - 用于查找值的 proc sql vs data 步骤形成一个包含异常的引用表

3 回答 3

Related

Reference