loops - 识别元素列表中的唯一字符串值

Question

我有一个不平衡的大型数据集，其中每个观察都可以采用多个字符串值，每个值都存储在一个单独的变量中：

obs    year   var1    var2    var3    newval  

1      1990   str1    str2    str3     3   

1      1991   str1    str4    str5     2  

2      1990   str3    str4             2  

2      1991   str4    str5             1  

2      1993   str3    str5             0 

2      1994   str7                     1

在每个时间点和每次观察，我都需要计算字符串值是否是“新的”。这意味着它们没有出现在前几年的观察值中。

我应该如何在Stata中解决这个问题？

谢谢你。

score 1 · Accepted Answer

这个问题也发布在Statalist上。这是我的答案。merge除非问题始于两个或更多文件，否则我倾向于不使用s。

clear
input obs     yr   str4 var1 str4  var2 str4   var3
1        90   str1    str2    str3
1        91    str1    str4    str5
2        90    str3    str4
2        91    str4    str5
2        93    str3    str5
2        94    str7
end
reshape long var , i(obs yr) j(which)
bysort obs var (yr) : gen new = _n == 1 & !missing(var)
bysort obs yr : replace new = sum(new)
by obs yr : replace new = new[_N]
reshape wide var, i(obs yr) j(which)

（更多）进一步的评论主要集中在效率上，这意味着这里的速度而不是空间。（存储空间可能会咬海报。）

如果没有重组，这里使用reshape，问题是一个三重循环：超过标识符，超过每个标识符的观察和变量。可能两个外环可以折叠成一个。但是在Stata中，对观察结果的显式循环通常很慢。

使用 Dimitriy 和我提出的重组解决方案，by:操作直接进入编译代码并且相对较快：reshape是解释代码并且需要文件操作，因此可能很慢。另一方面reshape，有一些经验可以很快写下来，而且确实值得获得reshape经验所带来的流畅性。除了帮助reshape和手动输入，请参阅我在http://www.stata.com/support/faqs/data-management/problems-with-reshape/reshape上写的常见问题解答

另一个考虑因素是您还想对这种数据集做什么。如果还有其他类似特征的问题，通常使用由生成的长结构会更容易reshape，因此保留该结构将是一个好主意。

score 1 · Accepted Answer

可能有一种更优雅的方法可以做到这一点。

主要思想是我首先重塑数据并按顺序计算每个字符串的出现次数。重塑使这更容易。然后我将与折叠聚合，但只计算每个字符串出现的第一个实例。然后我将重新加入您的原始数据。

#delimit;

preserve;
    tempfile newval;

    reshape long var, i(obs year) j(s); // stack all the vars on top of each other
    bys obs var (year): gen n=_n if !missing(var); // number the appearance of each string in chronological order
    replace n=0 if n>1 & !missing(n); // only count the first instance

    collapse (sum) mynewval=n, by(obs year); // add up the counts
    save `newval';
restore;

merge 1:1 obs year using `newval', nogen;

compare newval mynewval;

loops - 识别元素列表中的唯一字符串值

2 回答 2

Related

Reference