sas - Merging on closest value in SAS

Question

Is there a way in SAS to do a fuzzy left merge based on a numeric field? Say I have the two tables below and want to merge on the closest value possible.

Dataset A:

id_1    label
1       a
2       b
3       c
4       d
6       e

Dataset B:

The result should be:

id_2    id_1    label
1.1     1       a
2.9     3       c
3.4     3       c
4.05    4       d
5.1     6       e

Please note that rounding isn't an option here because of the 5.1 case.

score 2 · Accepted Answer

一种方式，SQL笛卡尔连接。这不是非常快，因此对于大型数据集来说，这不是一个好的解决方案。

data have_a;
input id_1    label $;
datalines;
1       a
2       b
3       c
4       d
6       e
;;;;
run;

data have_b;
input id_2 ;
datalines; 
1.1     
2.9     
3.4     
4.05   
5.1
;;;;
run;

proc sql;
create table want as
    select B.id_2, A.label , abs(A.id_1-B.id_2) as id_dist
    from have_a A, have_b B
    group by B.id_2
    having id_dist=min(id_dist);
quit;

可以根据每个数据集的大小构建其他解决方案（都非常大，一大一小，或者都小）。例如，PROC FORMAT 返回一个不错的结果。

data have_a_fmt;
retain fmtname 'HAVE_AF';
set have_a(rename=(id_1=startpoint label=startlabel));
set have_a(firstobs=2);
set have_a(firstobs=3 rename=(id_1=endpoint label=endlabel)) end=eof;
start=id_1-(id_1-startpoint)/2;
end  =id_1+(endpoint-id_1)/2;
output;

if _n_=1 then do;
  hlo='l';
  end=start;
  start=.;
  label=startlabel;
  output;
end;
if eof then do;
  start=end;
  end=.;
  hlo='h';
  label=endlabel;
  output;
end;
run;

proc format cntlin=have_a_Fmt;
quit;

data want;
set have_b;
label=put(id_2,HAVE_AF.);
run;

除非 have_A 非常大（数百万以上），否则格式化解决方案非常快。它的工作原理是进行前瞻和后视合并（使用集合，但概念相同）一次获得 3 个值，前一个当前值和下一个值，使用它们来定义范围，并添加第一行和最后一行 'low '和'hlo'变量的'high'值（基本上将'负无穷大'和'正无穷大'定义为范围内的端点）。

score 0 · Accepted Answer

这在概念上可以更快，尽管它确实需要首先对数据进行排序......

没有查找，只是一步一步的线性合并。

您将需要 3 个 SET 语句才能工作（一个用于当前数据，两个用于查找中的连续记录）。然后：

做直到（用完数据）

如果您用完了查找，请将最后一个设置为无穷大（或以其他方式使其无效）
如果第一个查找记录比第二个更接近，使用它并移动到下一个数据点，保持相同的查找记录
否则，将两个查找记录推进一步，保持相同的数据记录

结尾

根据您是否要保持联系，将步骤 2 中的检查设为 < 或 <=，或添加其他逻辑以其他方式处理它们。

score 0 · Accepted Answer

这是我想出的，但可能需要根据对问题的澄清进行轻微修改：

proc sql noprint;
  create table final as
  select a.label,
         b.id_2
  from havea a
  join haveb b on 1 = 1
  left join haveb c on abs(c.id_2 - a.id_1) < abs(b.id_2 - a.id_1)
  where c.id_2 eq .
  ;
quit;

它与 Joe 的相似之处在于它使用笛卡尔连接（在 1=1 上连接），但不同之处在于它不使用 group-by 语句。

它通过第二次加入表haveb来查看是否可以找到更小的值差异。如果不能，那么我们知道我们有最小的行，并且我们通过在 where 子句中应用过滤器来保留这些行。

不确定是否有任何性能差异。我认为乔的having方法更容易阅读/理解，所以我会这样做。

sas - Merging on closest value in SAS

3 回答 3

Related

Reference