stata - how to detect specific value combination (or condition) of variables within group

Question

I have a survey dataset which contains household ids and individual ids within each household: individual 1 represents the interviewee him/herself. Some variable represents each individual's relationship to the interviewee (for example, 2 for spouse, 3 for parents and so on), the data structure like the following

???

Now what I want to do is detect the occurrence of certain values in var1 and, if it occurs, whether the values of var1 and var2 satisfy a certain condition.

For example, if var1 and var2 satisfy

(var1 == 3 & var2 == 1) | (var1 == 4 & var2 == 1)

then I can attach value 1 to a new generated variable, say var3, for each individual in the same group (household in this case, to represent family structure) and 0 otherwise.

It seems not a big problem, and I suppose I should employ some

 by group: egen

or

by group: gen

command, but I'm not sure. I used to apply commands like

gen l_w_p = 0
by hhid: replace l_w_p = 1 if (var1 == 3 & a2004 == 1) | (var2 == 4 & a2004 == 1)
by hhid: replace l_w_p = 2 if (var1 == 3 & a2004 == 2) & (var2 == 4 & a2004 == 2)

but it seems it doesn't work. Does that need some kind of loop?

score 2 · Accepted Answer

@Dimitriy V. Masterov 提供了一个很好的具体答案，但还有更广泛地解决这个问题的空间。

正如他的回答所示，

形式问题：这个群体中是否有任何成员具有这种特征？可以通过在组上使用egen'max()函数来处理产生 0 或 1 的真或假表达式，即指标（或在某些领域流行的不良术语中，虚拟）。

一个小小的想法表明

形式问题：这个群体的所有成员都有这个特征吗？可以通过在组上使用egen'min()函数来处理产生 0 或 1 等的真或假表达式。

整个故事在一个常见问题解答中得到了充实我如何创建一个变量来记录一个组的任何成员（或一个组的所有成员）是否具有某些特征？（因此元课程是利用您可用的资源）。

一步之遥是关于组其他成员的问题，也在常见问题解答中讨论如何为组中其他成员的每个单独属性创建变量汇总？

有关可能有用的更全面的讨论，请参阅这篇文章和这篇文章

另外两条评论：

一个。在这样的代码中

gen l_w_p = 0
by hhid: replace l_w_p = 1 if (var1 == 3 & a2004 == 1) | (var2 == 4 & a2004 == 1)
by hhid: replace l_w_p = 2 if (var1 == 3 & a2004 == 2) & (var2 == 4 & a2004 == 2)

前缀对by:所做的事情没有影响。该代码仍然在个人级别上工作，并且前缀不会将操作扩展到组。这就是为什么它“不起作用”，通常是一个相当无用的错误报告。

湾。温和的抽象有助于解释问题，但命名变量的抽象只会让你的代码更难阅读。我不会使用变量名，例如var1,var2等，这只会增加记住什么是什么的负担。使用令人回味的名称，例如any_unemployed或any_married或其他。这不仅仅是个人风格，当您要求其他人考虑您的代码时（如这里），能够轻松阅读它是一个很大的帮助。

score 2 · Accepted Answer

我很难弄清楚你在问什么。一个好的策略是举例说明您的数据和所需的输出，尽可能简化到问题的本质。这比用文字描述数据要容易得多。

让我们从简单的开始。假设您有如下所示的数据：

并且您想标记x曾经是 2 的家庭。一种方法是

bys hhid: egen tag=max(cond(x==2,1,0))

这将产生：

hhid   x   tag  
   1   1     1  
   1   2     1  
   2   0     0  
   2   1     0

从内到外，对于每个成员，您检查是否x曾经是 2。如果是，则该成员得到一个1. 如果没有，他会得到一个0. 计算该max()二元指标在整个家庭中的最大值。

条件可以变得更复杂，条件函数可以像俄罗斯娃娃一样嵌套。

这是一个更复杂的例子。假设您要在此数据集中标记某人拥有x = 2（用 a 标记1）或y >= 5（用 a 标记）的家庭：2

hhid   x   y  
   1   1   1  
   1   2   2  
   2   0   3  
   2   1   4  
   3   1   5  
   3   3   5

我们先检查x，再检查条件y是否x为假：

bys hhid : egen tag=max(cond(x==2,1,cond(y>=5,2,0)))

这产生：

hhid   x   y   tag  
   1   1   1     1  
   1   2   2     1  
   2   0   3     0  
   2   1   4     0  
   3   1   5     2  
   3   3   5     2

stata - how to detect specific value combination (or condition) of variables within group

2 回答 2

Related

Reference