3

可以使用 SPSS 命令(例如,MERGE FILES)在 2 个 SPSS 数据集之间执行左外连接吗?假设连接字段在任一数据集中都不是唯一的。

示例:让左侧 Dataset1 包含 2 个字段 - ClassNbr 和 Fact1 - 以及这 4 条记录。. .

1 A
1 D
2 A
3 B

让 Dataset2 包含 2 个字段 - ClassNbr 和 Fact2 - 以及这 3 条记录。. .

1 XX
1 XY
3 ZZ

我想在 ClassNbr 上加入 Dataset1 和 Dataset2。期望的结果是一个 6 条记录的数据集,如下所示:

1 A XX
1 A XY
1 D XX
1 D XY
2 A (NULL)
3 B ZZ

我更喜欢使用 SPSS 命令(而不是 SQL/Python/等)的解决方案。

4

2 回答 2

2

据我所知,你不能直接这样做。解决此问题的一种可能方法是将数据从长格式“重塑”为宽格式(使用casestovars),进行合并,然后重新整形为长格式(使用varstocases)。下面是一个使用示例(如果需要对代码进行任何说明,请询问)。

data list free / ClassNbr (F1) Fact1 (A1).
begin data
1 A
1 D
2 A
3 B
end data.
dataset name data1.

casestovars 
/id = ClassNbr.

data list free / ClassNbr (F1) Fact2 (A2).
begin data
1 XX
1 XY
3 ZZ
end data.
dataset name data2.

casestovars 
/id = ClassNbr.

match files file = 'data1'
/file = 'data2'
/by ClassNbr.
execute.

varstocases
/make Fact1 FROM Fact1.1 to Fact1.2
/null = KEEP.
varstocases
/make Fact2 FROM Fact2.1 to Fact2.2
/null = KEEP.

这会产生一些您不想要的情况,在这里我刚刚定义了一组命令来识别这些情况并将它们取出(我相信这可以改进以提高效率)。

*now cleaning up the extra records.
compute flag = 0.
if ClassNbr = lag(ClassNbr) and Fact1 = lag(Fact1) and Fact2 = lag(Fact2) flag = 1.
select if flag = 0.
execute.
if Fact1 = " " and Fact2 = " " flag = 1.
select if flag = 0.
execute.
if ClassNbr = lag(ClassNbr) and Fact1 = lag(Fact1) and Fact2 = " " flag = 1.
select if flag = 0.
execute.
if ClassNbr = lag(ClassNbr) and Fact2 = lag(Fact2) and Fact1 = " " flag = 1.
select if flag = 0.
execute.

我确信可以使它更健壮(可能制作一些自定义的python函数)。但希望这有助于您入门。

于 2011-10-04T14:28:41.883 回答
2

如果您安装“STATS CARTPROD”扩展包,您可以执行此操作。使用此扩展,您可以创建笛卡尔积作为创建外连接的中间步骤。

从 SPSS 22 开始,您可以直接从程序菜单 Extra->Extension Bundles->Install and Download extension bundles 下载它。您也可以从此处手动下载并安装它:https ://www.ibm.com/developerworks/community/files/app?lang=en#/file/d0afcd4e-6d5d-4779-84ef-2b68bc81b861 请 注意,您必须拥有安装“Python Essentials for SPSS”以使其工作。

*** create the example data.

DATA LIST FREE / classnbr1 (F1) fact1 (A1).
BEGIN DATA
1 A
1 D
2 A
3 B
END DATA.
DATASET NAME data1.

DATA LIST FREE / classnbr2 (F1) fact2 (A2).
BEGIN DATA
1 XX
1 XY
3 ZZ
END DATA.
DATASET NAME data2.

在使用“STATS CARTPROD”扩展时,我在变量名中使用大写字母时遇到了问题。同样重要的是,“classnbr”在两个数据集中具有不同的变量名称。

*** create cartesian product using the STATS CARTPROD extension.

DATASET ACTIVATE data1.

STATS CARTPROD INPUT2=data2
   VAR1=classnbr1 fact1 VAR2=classnbr2 fact2
/SAVE OUTFILE="C:\MY FOLDER\cardprod.sav" DSNAME = cart.
EXECUTE.

*** create an equi join.

SELECT IF classnbr1 = classnbr2.
EXECUTE.

DELETE VARIABLES classnbr2.

现在包括在 data2 中不匹配的案例。

*** create left outer join
* assuming both data sets are ordered by classnbr1 and fact1

ADD FILES 
   /FILE = cart
   /FILE = data1
   /BY classnbr1 fact1.
EXECUTE.
DATASET NAME outer_join.
DATASET ACTIVATE outer_join.

COMPUTE select=1.
IF (length(fact2)=0 AND classnbr1=LAG(classnbr1) AND fact1=LAG(fact1)) select=0.
EXECUTE.

SELECT IF select = 1.
EXECUTE.
DELETE VARIABLES select.

但是,当使用非常大的数据集时,您可能会遇到一些麻烦。在这种情况下,笛卡尔积将是巨大的。

为了稍微减轻这种影响,您可以在生成笛卡尔积之前从数据集中删除所有与其他数据集没有对应匹配的案例。

这是如何做到的:

*** create the example data.
*** (I added an additional case to the second data set, which will be deleted 
in the result, since it has no match in the first data set)

DATA LIST FREE / classnbr1 (F1) fact1 (A1).
BEGIN DATA
1 A
1 D
2 A
3 B
END DATA.
DATASET NAME data1.

DATA LIST FREE / classnbr2 (F1) fact2 (A2).
BEGIN DATA
1 XX
1 XY
3 ZZ
4 XY
END DATA.
DATASET NAME data2.


*** select cases who (don't) have a matching correspondent in the other dataset

** Create a list of unique key values of data set data2 
** (In this Example the key Value is classnbr2).

DATASET ACTIVATE data2.
DATASET COPY data2_keylist.
DATASET ACTIVATE data2_keylist.

* Assuming the data set is already sorted by the key value.
* Mark the first occurance of every key kalue in the data set.
COMPUTE list = 1.
IF classnbr2 = LAG(classnbr2) list = 0.
SELECT IF list=1.
EXECUTE.

* Delete all variables except the (now unique) key value

MATCH FILES
   /FILE *
   /KEEP classnbr2.
EXECUTE.

** Match the list of data2 key values to data1 in order to mark
** which cases of data1 have at least one correspondent case in data 2.
DATASET ACTIVATE data1.
MATCH FILES
   /FILE *
   /TABLE data2_keylist
      /RENAME classnbr2=classnbr1
      /IN data2
   /BY classnbr1.
EXECUTE.

** Remove cases from data1 who don't have a correspondent in data2
** and store them in another dataset, because we need to add them later.
DATASET COPY date1_nomatch.
SELECT IF data2=1.
EXECUTE.

DATASET ACTIVATE date1_nomatch.
SELECT IF data2=0.
EXECUTE.

** Now doing the same for the other data set.

** Create a list of unique key values of data set data1 
** (In this Example the key Value is classnbr1).

DATASET ACTIVATE data1.
DATASET COPY data1_keylist.
DATASET ACTIVATE data1_keylist.

* Assuming the data set is already sorted by the key value.
* Mark the first occurance of every key kalue in the data set.
COMPUTE list = 1.
IF classnbr1 = LAG(classnbr1) list = 0.
SELECT IF list=1.
EXECUTE.

* Delete all variables except the (now unique) key value

MATCH FILES
   /FILE *
   /KEEP classnbr1.
EXECUTE.

** Match the list of data2 key values to data1 in order to mark
** which cases of data1 have at least one correspondent case in data 2.
DATASET ACTIVATE data2.
MATCH FILES
   /FILE *
   /TABLE data1_keylist
      /RENAME classnbr1=classnbr2
      /IN data1
   /BY classnbr2.
EXECUTE.


** Remove cases from data1 who don't have a correspondent in data2.
SELECT IF data1=1.
EXECUTE.

*** create a cartesian product of the two reduced datasets.
DATASET ACTIVATE data1.

STATS CARTPROD INPUT2=data2
   VAR1=classnbr1 fact1 VAR2=classnbr2 fact2
/SAVE OUTFILE="C:\MY FOLDER\cardprod.sav" DSNAME = outer_join.
EXECUTE.

*** create an equi join.
SELECT IF classnbr1 = classnbr2.
EXECUTE.

DELETE VARIABLES classnbr2.


*** create left outer join by adding the cases from date1_nomatch.
DATASET ACTIVATE outer_join.
ADD FILES 
   /FILE = *
   /FILE = date1_nomatch
   /BY classnbr1 fact1
   /DROP data2.
EXECUTE.

* Some cleaning up.
DATASET CLOSE data1_keylist.
DATASET CLOSE date1_nomatch.
DATASET CLOSE data2_keylist.
于 2014-07-23T18:12:57.513 回答