我有这种文本格式的 GB 数据:
1,'Acct01','Freds Autoshop'
2,'3-way-Cntrl','Y'
1000,576,686,837
1001,683,170,775
1,'Acct02','Daves Tacos'
2,'centrifugal','N'
1000,334,787,143
1001,749,132,987
第一列表示行内容,是对每个 Account 重复的索引系列(Acct01、Acct02...)。具有索引值 (1,2) 的行与每个帐户(父帐户)一一关联。我想将此数据展平为一个数据框,该数据框将帐户级别数据(索引 = 1,2)与其相关的系列数据(1000、10001、1002、1003 ...)与平面 df 中的子数据相关联。
所需的df:
'Acct01','Freds Autoshop','3-way-Cntrl','Y',1000,576,686,837
'Acct01','Freds Autoshop','3-way-Cntrl','Y',1001,683,170,775
'Acct02','Daves Tacos',2,'centrifugal','N',1000,334,787,143
'Acct02','Daves Tacos',2,'centrifugal','N',1001,749,132,987
我已经能够在一个非常机械、非常缓慢的逐行过程中做到这一点:
import pandas as pd
import numpy as np
import time
file = 'C:\\PythonData\\AcctData.txt'
t0 = time.time()
pdata = [] # Parse data
acct = [] # Account Data
row = {} #Assembly Container
#Set dataframe columns
df = pd.DataFrame(columns=['Account','Name','Type','Flag','Counter','CNT01','CNT02','CNT03'])
# open the file and read through it line by line
with open(file, 'r') as f:
for line in f:
#Strip each line
pdata = [x.strip() for x in line.split(',')]
#Use the index to parse data into either acct[] for use on the rows with counter > 2
indx = int(pdata[0])
if indx == 1:
acct.clear()
acct.append(pdata[1])
acct.append(pdata[2])
elif indx == 2:
acct.append(pdata[1])
acct.append(pdata[2])
else:
row.clear()
row['Account'] = acct[0]
row['Name'] = acct[1]
row['Type'] = acct[2]
row['Flag'] = acct[3]
row['Counter'] = pdata[0]
row['CNT01'] = pdata[1]
row['CNT02'] = pdata[2]
row['CNT03'] = pdata[3]
if indx > 2:
#data.append(row)
df = df.append(row, ignore_index=True)
t1 = time.time()
totalTimeDf = t1-t0
TTDf = '%.3f'%(totalTimeDf)
print(TTDf + " Seconds to Complete df: " + i_filepath)
print(df)
结果:
0.018 Seconds to Complete df: C:\PythonData\AcctData.txt
Account Name Type Flag Counter CNT01 CNT02 CNT03
0 'Acct01' 'Freds Autoshop' '3-way-Cntrl' 'Y' 1000 576 686 837
1 'Acct01' 'Freds Autoshop' '3-way-Cntrl' 'Y' 1001 683 170 775
2 'Acct02' 'Daves Tacos' 'centrifugal' 'N' 1000 334 787 143
3 'Acct02' 'Daves Tacos' 'centrifugal' 'N' 1001 749 132 987
这可行,但速度非常慢。我怀疑有一种非常简单的pythonic方法可以导入和组织到df。看来 OrderDict 将按如下方式正确组织数据:
import csv
from collections import OrderedDict
od = OrderedDict()
file_name = 'C:\\PythonData\\AcctData.txt'
try:
csvfile = open(file_name, 'rt')
except:
print("File not found")
csvReader = csv.reader(csvfile, delimiter=",")
for row in csvReader:
key = row[0]
od.setdefault(key,[]).append(row)
od
结果:
OrderedDict([('1',
[['1', "'Acct01'", "'Freds Autoshop'"],
['1', "'Acct02'", "'Daves Tacos'"]]),
('2',
[['2', "'3-way-Cntrl'", "'Y'"],
['2', "'centrifugal'", "'N'"]]),
('1000',
[['1000', '576', '686', '837'], ['1000', '334', '787', '143']]),
('1001',
[['1001', '683', '170', '775'], ['1001', '749', '132', '987']])])
从 OrderDict 我无法弄清楚如何组合键 1,2 并与特定系列的键(1000、1001)关联,然后附加到 df 中。如何在展平父/子数据时从 OrderedDict 转到 df?或者,有没有更好的方法来处理这些数据?