python - Computing the set difference among two consecutive rows in dataframe

Question

I need some tips to make a calculation.

My dataframe looks like the following:

text_id     name     date                words
1           John     2018-01-01          {ocean, blue}
1           John     2018-02-01          {ocean, green} 
2           Anne     2018-03-01          {table, chair}
3           Anne     2018-03-01          {hot, cold, warm}
3           Mark     2018-04-01          {hot, cold}
3           Ethan    2018-05-01          {warm, icy}
4           Paul     2018-01-01          {cat, dog, puppy}
4           John     2018-02-01          {cat}
5           Paul     2018-03-01          {cat, sheep, deer}

In the text, the text_id stands for an specific text (SAME TEXT_ID = SAME TEXT). The name column stands for the person that has edited the text. The date column stands for the date in which the user made the edit. The words column is composed by the words that form the text after the users edit.

The words column is a set. I need to add an aditional column, added_words, which contains the set difference of the previous edit on THE SAME text. This is in order to check whats the difference between one edit and its consecutive one IN THE SAME TEXT.

The sample output here would be:

text_id     name     date          words            added_words
1           John     2018-01-01    {ocean,blue}     {ocean, blue}
1           John     2018-02-01    {ocean,green}    {green}
2           Anne     2018-03-01    {table,chair}    {table, chair}
3           Anne     2018-03-01    {hot,cold,warm}  {hot, cold, warm}
3           Mark     2018-04-01    {hot,cold}       {}
3           Ethan    2018-05-01    {warm,icy}       {warm, icy}
4           Paul     2018-01-01    {cat,dog,puppy}  {cat, dog, puppy}
4           John     2018-02-01    {cat}            {}
5           Paul     2018-03-01    {cat,sheep,deer} {cat,sheep,deer}

Note that basically, the added_words column contains the set difference among the words column in row i and words column in row i-1, only if the text_id in row i and row i-1 is the same, because: I only want the difference among the SAME text (same text_id), not different ones.

Any tips on this will be extremely helpful.

EDIT:

In order to turn the words column into a set, do:

df['words'] = df['words'].str.strip('{}').str.split(',').apply(set)

score 1 · Accepted Answer

Since you have sets, we can get the difference of those by simply substracting them with shift, while using groupby:

df['added_words'] = df.groupby('text_id')\
                      .apply(lambda x: (x['words'] - x['words'].shift()).fillna(x['words']))\
                      .to_numpy()

note: if you have pandas < 0.24.0 use .values instead of to_numpy()

Output

   text_id   name        date               words         added_words
0        1   John  2018-01-01       {blue, ocean}       {blue, ocean}
1        1   John  2018-02-01      {ocean, green}             {green}
2        2   Anne  2018-03-01      {table, chair}      {table, chair}
3        3   Anne  2018-03-01   {hot, warm, cold}   {hot, warm, cold}
4        3   Mark  2018-04-01         {hot, cold}                  {}
5        3  Ethan  2018-05-01         {icy, warm}         {icy, warm}
6        4   Paul  2018-01-01   {cat, puppy, dog}   {cat, puppy, dog}
7        4   John  2018-02-01               {cat}                  {}
8        5   Paul  2018-03-01  {cat, sheep, deer}  {cat, sheep, deer}

score 1 · Accepted Answer

Use diff and fillna. Diff will do set subtraction

df['added_words'] = df.groupby('text_id').words.diff().fillna(df.words)

In [162]: df
Out[162]:
   text_id   name        date               words         added_words
0        1   John  2018-01-01       {ocean, blue}       {ocean, blue}
1        1   John  2018-02-01      {green, ocean}             {green}
2        2   Anne  2018-03-01      {chair, table}      {chair, table}
3        3   Anne  2018-03-01   {warm, cold, hot}   {warm, cold, hot}
4        3   Mark  2018-04-01         {cold, hot}                  {}
5        3  Ethan  2018-05-01         {warm, icy}         {warm, icy}
6        4   Paul  2018-01-01   {cat, puppy, dog}   {cat, puppy, dog}
7        4   John  2018-02-01               {cat}                  {}
8        5   Paul  2018-03-01  {cat, deer, sheep}  {cat, deer, sheep}

python - Computing the set difference among two consecutive rows in dataframe

2 回答 2

Related

Reference