2

I have some output from an online task that I need to wrangle into a usable form for scoring, which requires multiple conditions in order to be wrangled correctly. I tried using if and else statements but I am having a hard time meeting all the conditions needed this way. A description of the data and conditions needed - in the first column there are three possible values corresponding to the person's response: either 'yes', 'no', or 'NR' (meaning no response given yet). In the second column is a type of counter, which is supposed to run from 1-5 sequentially, but it will repeat a value if the person held down the key for too long. So for each count in the second column, I want there to be one corresponding response in the first column, which should be the first response given ('yes' or 'no') for that count. If no response is given during that entire count, it should stay as 'NR'. The count then repeats itself from 1-5 again. For example, this input data:

   response  count
0       yes      1
1       yes      1
2       yes      1
3        no      1
4       yes      1
5        no      2
6        no      2
7        no      2
8        NR      3
9        NR      3
10       no      3
11       NR      3
12       NR      4
13       NR      4
14       NR      4
15      yes      5
16      yes      5
17       NR      1
18       NR      1
19       NR      2
20      yes      3
21      yes      3
22      yes      3
23       no      4
24      yes      4
25       no      5

Should reduce to this:

  response  count
0      yes      1
1       no      2
2       no      3
3       NR      4
4      yes      5
5       NR      1
6       NR      2
7      yes      3
8       no      4
9       no      5

It is a bit of a confusing problem, and so far I haven't found a combination of conditions or if/else statements applied to arrays that have given me the outcome I wanted. Any help or ideas would be much appreciated!

Source code for input data:

response = ['yes','yes','yes','no','yes','no','no','no','NR','NR','no','NR','NR','NR','NR','yes','yes','NR','NR','NR','yes','yes','yes','no','yes','no']
count = [1,1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,1,1,2,3,3,3,4,4,5]
data_dict = {'response': response,
            'count':count}
data = pd.DataFrame(data_dict)
4

1 回答 1

0

尝试这个:

df.groupby(['count', (df['count'] != df['count'].shift()).cumsum()])['response']\
  .apply(lambda x: 'NR' if (x.nunique()==1) & (x == 'NR').all() else x.loc[x!='NR'].iloc[0])\
  .sort_index(level=1).reset_index(level=1, drop=True)

输出:

count
1    yes
2     no
3     no
4     NR
5    yes
1     NR
2     NR
3    yes
4     no
5     no
Name: response, dtype: object

细节:

让我们生成一个系列来首先获取重复组:

(df['count'] != df['count'].shift()).cumsum()

使用这个系列,连同“计数”,我们可以创建响应组,

如果唯一响应的计数等于 1,并且该响应为“NR”,则返回该组的“NR”。否则,使用 . 返回不是“NR”的第一个响应x.loc[X!='NR'].iloc[0]

于 2020-09-23T18:21:58.827 回答