我的第一个问题是,如何从数字中删除所有非数字部分,例如“100M”和“0N#”,它们应该分别为 100 和 0。
import re
df = pd.read_csv(yourfile, header=None)
df.columns = ['ID'] + list(df.columns)[1:]
df = df.stack().apply(lambda v: re.sub('[^0-9]','', v)
if isinstance(v, str) else v).astype(float).unstack()
df.groupby('ID').agg(['std', 'mean'])
这里.stack()将数据框转换为系列,.apply()为每个值调用 lambda,re.sub()删除任何非数字字符,.astype()转换为数字并将unstack()系列转换回数据框。这对于整数和浮点数同样适用。
给定一个特定的列,我想按 ID 拆分行,然后输出每个 ID 的平均值和标准差。
# for all columns
df.groupby('ID').agg(['std', 'mean'])
# for specific column
df.groupby('ID')['<colname>'].agg(['std', 'mean'])

以下是示例中使用的数据:
from StringIO import StringIO
s="""
1,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
1,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
2,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
2,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
"""
yourfile = StringIO(s)