我有一个数据框,其中每一行都包含一个字符串列表。我编写了一个函数,对每个字符串执行伯努利类型的试验,如果试验成功,每个单词都有一定的概率(这里为 0.5)被删除。见下文:
import numpy as np
import pandas as pd
def bernoulli_trial (sublist, prob = 0.5):
# create mask of trial outcomes per each object in sublist
mask = np.random.binomial(n=1, p=prob, size=len(sublist))
# perform transformation on bernoulli successes
transformed_sublist = [token for delete, token in zip(mask, sublist) if not delete]
return transformed_sublist
当我传递数据帧的每一行时,这按预期工作,根据:
df = pd.DataFrame(data={'store': [1,2,3], 'colours': [['red','blue','yellow','green','brown','pink'],
['black','white'],
['purple','orange','cyan','mauve']]})
df['colours'] = df['colours'].apply(bernoulli_trial)
Out:
0 [red, green]
1 [black]
2 [orange, cyan]
Name: colours, dtype: object
但是,我现在要做的不是在每个子列表和每个字符串中统一应用函数,而是对(a)给定子列表是否将传递给函数(是/否)和(b)应用条件将应用该子列表中的字符串(即通过指定我只想测试某些颜色)。
我认为我对(a)部分有一个可行的解决方案 - 通过将伯努利函数包装在一个检查是否满足给定条件的函数中(即子列表的长度是否大于 2 个对象?) - 这有效(见下文)但我不确定是否有更有效的(阅读更多pythonic)方法来做到这一点。
def sublist_condition_check(sublist):
if len(sublist) > 2:
sublist = bernoulli_trial(sublist)
else:
sublist = sublist
return sublist
请注意,任何不满足条件的子列表应保持不变。
df['colours'].apply(sublist_condition_check)
Out:
0 [red, brown]
1 [black, white] # this sublist had only two elements so remains unchanged
2 [mauve]
Name: colours, dtype: object
但是,我完全不知道如何对每个单词应用条件逻辑。例如,假设我只想将试验应用于预先指定的颜色列表 ['red','mauve','black'] - 前提是它通过了子列表条件检查 - 我该怎么做呢?
我希望实现的伪代码如下所示:
for sublist in df:
if len(sublist) > 2: # check if sublist contains more than two objects
for colour in sublist: # cycle through each colour within the sublist
if colour in ['red','mauve','black']:
colour = bernoulli_trial (colour) # only run bernoulli if colour in list
else:
colour = colour # if colour not in list, colour remains unchanged
else:
sublist = sublist # if sublist <= 2, sublist remains unchanged
我知道对此的字面解释是行不通的,因为最初的 bernoulli_trial 函数接收一个列表而不是单个字符串。但希望它描述了我想要实现的目标。