I need some tips to make a calculation.
My dataframe looks like the following:
text_id name date words
1 John 2018-01-01 {ocean, blue}
1 John 2018-02-01 {ocean, green}
2 Anne 2018-03-01 {table, chair}
3 Anne 2018-03-01 {hot, cold, warm}
3 Mark 2018-04-01 {hot, cold}
3 Ethan 2018-05-01 {warm, icy}
4 Paul 2018-01-01 {cat, dog, puppy}
4 John 2018-02-01 {cat}
5 Paul 2018-03-01 {cat, sheep, deer}
In the text, the text_id stands for an specific text (SAME TEXT_ID = SAME TEXT). The name column stands for the person that has edited the text. The date column stands for the date in which the user made the edit. The words column is composed by the words that form the text after the users edit.
The words column is a set. I need to add an aditional column, added_words, which contains the set difference of the previous edit on THE SAME text. This is in order to check whats the difference between one edit and its consecutive one IN THE SAME TEXT.
The sample output here would be:
text_id name date words added_words
1 John 2018-01-01 {ocean,blue} {ocean, blue}
1 John 2018-02-01 {ocean,green} {green}
2 Anne 2018-03-01 {table,chair} {table, chair}
3 Anne 2018-03-01 {hot,cold,warm} {hot, cold, warm}
3 Mark 2018-04-01 {hot,cold} {}
3 Ethan 2018-05-01 {warm,icy} {warm, icy}
4 Paul 2018-01-01 {cat,dog,puppy} {cat, dog, puppy}
4 John 2018-02-01 {cat} {}
5 Paul 2018-03-01 {cat,sheep,deer} {cat,sheep,deer}
Note that basically, the added_words column contains the set difference among the words column in row i and words column in row i-1, only if the text_id in row i and row i-1 is the same, because: I only want the difference among the SAME text (same text_id), not different ones.
Any tips on this will be extremely helpful.
EDIT:
In order to turn the words column into a set, do:
df['words'] = df['words'].str.strip('{}').str.split(',').apply(set)