python - 在加载到数据框之前读取需要数据清理的 CSV 文件

Question

我正在将 CSV 文件读入熊猫。问题是文件需要删除其他行上的行和计算值。我现在的想法是这样开始的

    with open(down_path.name) as csv_file:
    rdr = csv.DictReader(csv_file)
    for row in rdr:
        type = row['']
        if type == 'Summary':
            current_ward = row['Name']
        else:
            name = row['Name']
            count1 = row['Count1']
            count2 = row['Count2']
            count3 = row['Count3']
            index_count += 1
        # write to someplace

,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0

最终结果需要以我可以连接到现有数据帧的数据帧结束。

Braindead 的方法就是简单地进行转换并创建一个新的 CSV 文件，然后将其读入。似乎是一种非 Pythonic 方式。

需要取出摘要行，将具有相似名称的那些（Aloha 1 和 Aloha I）合并，删除个人统计信息并在每个人上贴上 Aloha 1 标签。另外我需要添加这些数据来自哪个月份。如您所见，数据需要一些工作:)

期望的输出是 Jan-16, Aloha 1, John, 1,2,3

Aloha 1 来自其上方的摘要行

score 1 · Accepted Answer

我个人的偏好是在 Pandas 中做所有事情。

也许像这样的东西......

# imports
import numpy as np
import pandas as pd
from StringIO import StringIO

# read in your data
data = """,Name,count1,count2,count3
Ward Summary,Aloha 1,35,0,0
Individual Statistics,John,35,0,0
Ward Summary,Aloha I,794,0,0
Individual Statistics,Walter,476,0,0
Individual Statistics,Deborah,182,0,0"""
df = pd.read_csv(StringIO(data))

# give the first column a better name for convenience
df.rename(columns={'Unnamed: 0':'Desc'}, inplace=True)

# create a mask for the Ward Summary lines
ws_mask = df.Desc == 'Ward Summary'

# create a ward_name column that has names only for Ward Summary lines
df['ward_name'] = np.where(ws_mask, df.Name, np.nan)

# forward fill the missing ward names from the previous summary line
df.ward_name.fillna(method='ffill', inplace=True)

# get rid of the ward summary lines
df = df.ix[~ws_mask]

# get rid of the Desc column
df.drop('Desc', axis=1)

是的; 您不止一次地传递数据，因此您可以使用更智能的单次传递算法做得更好。但是，如果性能不是您主要关心的问题，我认为这在简洁性和可读性方面有好处。

python - 在加载到数据框之前读取需要数据清理的 CSV 文件

1 回答 1

Related

Reference