python - Audit Script function.sample 查询？

Question

我正在编写一个审计脚本，该脚本应该从数据框中的每个类别中抽取 1%、3% 或 5% 的样本，除非 1% 小于 3，否则它会提供 3 个样本。问题是类别会根据 excel 文件而变化。从上面解释的某个类别中抽取样本的语法如下：

df2.groupby('Category')['shoe'].apply(lambda X: x.sample(n=3) if x.size*0.01 
< 3 else x.sample(frac=0.01))

问题是我想遍历读取文件中的每个类别，并对其进行采样。最后，将其组合成一个数据框。

import pandas as pd

df = pd.read_excel(r"C:\Users\***\Desktop\***.xlsx")

df2 = df.loc[(df['Track Item']=='Y')]
print(len(df2))

categories = df2['Category'].unique
subcategories = dfs['Subcategory'].unique

def sample_per(df2):
    if len(df2) >= 15000:
        return df2.groupby('Category').apply(lambda x: x.sample(n=3) if x.size*0.01 < 3 else 
        x.sample(frac=0.01))
    elif len(df2) < 15000 and len(df2) > 10000:
        return df2.groupby('Category').apply(lambda x: x.sample(n=3) if x.size*0.03 < 3 else 
        x.sample(frac=0.03))
    else:
        return df2.groupby('Category').apply(lambda x: x.sample(n=3) if x.size*0.05 < 3 else 
        x.sample(frac=0.05))

final = sample_per(df2)

df.loc[df['Retailer Item ID'].isin(final['Retailer Item ID']), 'Track Item'] 
= 'Audit'

df.to_csv('Test_2.csv',index=False)

该代码有效，但它只带回整个文件的 1%、3% 或 5%，而不是每个类别的百分比。任何帮助，将不胜感激。*间距有点偏离，因为线条不适合盒子）

我还尝试了以下方法，以尝试遍历所有类别：

return (df2.groupby('Category')[lambda x: x in categories].apply(lambda x: 
x.sample(n=3) if x.size*0.01 < 3 else x.sample(frac=0.01)))

score 0 · Accepted Answer

更改unique为unique()。尝试这些编辑sample_per：保留apply功能。添加一个category参数。替换df2.groupby('Category')为df2.loc[df2.Category == category, :]。检查是否sample_per适用于某个类别。然后，您可以在循环中对categories. 好工作。

python - Audit Script function.sample 查询？

1 回答 1

Related

Reference