4

我目前已经下载了 120 个文件(10 年,逐月)的 csv 数据。

我正在使用下面的一些代码将所有这些合并到一个按时间顺序排列的文档中,例如从 1/1/09 到 1/1/19。

from glob import glob
files = sorted(glob('*.csv'))
with open('cat.csv', 'w') as fi_out:
    for i, fname_in in enumerate(files):
        with open(fname_in, 'r') as fi_in:
                if i_line > 0 or i == 0:
                    fi_out.write(line)

这一切正常,但是知道我还下载了相同类型的数据,除了不同的产品。我还按时间顺序对所有这些新数据进行排序,但将其与旧数据集并排放置。

我收到这样的错误:

任何帮助,将不胜感激。

编辑1:

Traceback (most recent call last):
  File "/Users/myname/Desktop/collate/asdas.py", line 4, in <module>
    result = pd.merge(data1[['REGION', 'TOTALDEMAND', 'RRP']], data2, on='SETTLEMENTDATE')
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/reshape/merge.py", line 61, in merge
    validate=validate)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/reshape/merge.py", line 551, in __init__
    self.join_names) = self._get_merge_keys()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/reshape/merge.py", line 871, in _get_merge_keys
    lk, stacklevel=stacklevel))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 1382, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'SETTLEMENTDATE'

编辑2:

import pandas as pd
df1 = pd.read_csv("product1.csv") 
df2 = pd.read_csv("product2.csv") 
combine = pd.merge(df1, df2, on='DATE', how='outer')
combine.columns = ['product1_price', 'REGION1', 'DATE', 'product2_price', 'REGION2']
combine[['DATE','product1_price','product2_price']]
combine.to_csv("combine.csv",index=False)

错误:

Traceback (most recent call last):
  File "/Users/george/Desktop/collate/asdas.py", line 5, in <module>
    combine.columns = ['VICRRP', 'REGION1', 'SETTLEMENTDATE', 'QLD1RRP', 'REGION2']
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 4389, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 646, in _set_axis
    self._data.set_axis(axis, labels)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py", line 3323, in set_axis
    'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 9 elements, new values have 5 elements
4

3 回答 3

5

将数据加载到数据框中

import pandas as pd
data1 = pd.read_csv("filename1.csv") 
data2 = pd.read_csv("filename2.csv") 

合并两个数据框SETTLEMENTDATE

result = pd.merge(data1, data2, on='SETTLEMENTDATE')

这假设settlementdate两个数据帧之间存在一对一的关系。如果没有,就会有重复。

编辑:要删除列“PERIOD TYPE”,请执行

result = pd.merge(data1[['REGION', 'TOTALDEMA', 'RRP', 'SETTLEMENTDATE']], data2, on='SETTLEMENTDATE')
于 2019-01-18T10:08:44.533 回答
1

查看另一个选项,您可以outer在两个 csv 文件中不包含日期时使用,因此保留两个 csv 文件中的所有日期

完整模型如下:

import pandas as pd 
df1 = pd.DataFrame({
    'SETDATE':['01-06-2013','01-08-2013'],
    'Region':['VIC1','VIC1'],
    'RRP':[1,8]})
df2 = pd.DataFrame({
    'SETDATE':['01-06-2013','01-08-2014'],
    'Region':['QLD1','QLD1'],
    'RRP':[2,4]})

combine = pd.merge(df1, df2, on='SETDATE', how='outer')
combine.columns = ['VICRRP', 'Reg1', 'SETDATE', 'QLD1RRP', 'Reg2']
combine[['SETDATE','VICRRP','QLD1RRP']]

结果如下:

SETDATE VICRRP  QLD1RRP
0   01-06-2013  1.0 2.0
1   01-08-2013  8.0 NaN
2   01-08-2014  NaN 4.0
于 2019-01-18T10:35:43.863 回答
1

下面的所有代码 if 对于 python3

python 有一个标准库模块,叫做csv

该库默认是惰性的,

这意味着它仅在从文件中询问数据时才读取数据,

因此它不应该消耗太多的内存!

代码看起来像这样,如果代码中有问题,请原谅我

import csv
vicfilename = 'filename1.csv'
qldfilename = 'filename2.csv'
mergedfilename = 'newfile.csv'

with open(mergedfilename, 'w', newline='') as mergedfile:
    fieldnames = ['SETTLEMENTDATE', 'VIC DEMAND', 'VIC RRP', 'QLD DEMAND', 'QLD RRP']
    writer = csv.DictWriter(mergedfile, fieldnames=fieldnames)
    writer.writeheader()
    with open(vicfilename, 'r', newline='') as vicfile:
        vicreader = csv.DictReader(vicfile)
        with open(qldfilename, 'r', newline='') as qldfile:
            qldreader = csv.DictReader(qldfile)

            for vicrow in vicreader:
                for qldrow in qldreader:
                    if vicrow['SETTLEMENTDATE'] == qldrow['SETTLEMENTDATE']:
                        writer.writerow({'SETTLEMENTDATE': vicrow['SETTLEMENTDATE'],
                                         'VIC DEMAND': vicrow['TOTALDEMAND'],
                                         'VIC RRP': vicrow['RRP'],
                                         'QLD DEMAND': qldrow['TOTALDEMAND'],
                                         'QLD RRP': qldrow['RRP'])
                        break
                qldfile.seek(0)
                qldreader = csv.DictReader(qldfile)

欢迎代码改进!

于 2019-01-18T10:43:11.197 回答