python - 在大型数据集上使用 csv 来分隔参数时遇到问题

Question

我正在使用一个大型数据集 (OMNI)，我正在寻找解析数据并将每行数据放入一个列表~~的数组~~的方法。我对 Python 还很陌生，所以我边走边学。

这就是我所拥有的：

import Tkinter, tkFileDialog
import csv 

#Choose the file that you want to read from
root = Tkinter.Tk()
root.withdraw()


file_path = tkFileDialog.askopenfilename()
current_file = open(file_path , "r")

#OMNI_2001 = {}

reader = csv.reader(current_file, delimiter= ' ')

output_file = open('newdata.txt','w')
out = csv.writer(output_file)

for row in reader:
    out.writerow(row)
    print row
#print row[0::1]

我读入的一行数据如下所示：

2001 182  0  0 60 60   7   2  71   -695    320  0.22   -173    6.07    5.23    0.46   -2.00    0.69   -1.93    0.38    2.09   331.0  -329.5    24.5    19.8   8.66  101479.  1.90   0.64   2.25   8.0    6.67   29.65    3.55   12.73   -1.78   -0.70   288  -142   146    -3   -22    20    19   0.99

但在我输出新数据后，如下所示：

2001,182,,0,,0,60,60,,,7,,,2,,71,,,-695,,,,320,,0.22,,,-173,,,,6.07,,,,5.23,,,,0.46,,,-2.00,,,,0.69,,,-1.93,,,,0.38,,,,2.09,,,331.0,,-329.5,,,,24.5,,,,19.8,,,8.66,,101479.,,1.90,,,0.64,,,2.25,,,8.0,,,,6.67,,,29.65,,,,3.55,,,12.73,,,-1.78,,,-0.70,,,288,,-142,,,146,,,,-3,,,-22,,,,20,,,,19,,,0.99

我在做什么导致这么多额外的逗号？另外我将如何删除不需要的条目？

score 14 · Accepted Answer

您的 csv 文件在项目之间有多个空格。delimiter=' '使读者将每个空格视为分隔一个新列。这就是为什么行有这么多“额外”列的原因。

使用skipinitialspace=True导致紧跟在分隔符后面的空格被忽略。这将消除虚假的额外列。

import Tkinter, tkFileDialog
import csv 

#Choose the file that you want to read from
root = Tkinter.Tk()
root.withdraw()

file_path = tkFileDialog.askopenfilename()
with open(file_path , 'rb') as current_file:
    reader = csv.reader(current_file, delimiter= ' ', 
                        skipinitialspace=True)
    with open('newdata.txt','wb') as output_file:
        out = csv.writer(output_file)
        for row in reader:
            out.writerow(row)
            print row
            #print row[0::1]

score 3 · Accepted Answer

您的文件似乎并不是 CSV 文件。我建议使用loadtxt()或genfromtxt()来自 NumPy 模块，或者，如果不能使用 NumPy，请自己解析文件：

with open(file_path) as current_file:
    for line in current_file:
        data_row = map(float, line.split())
        # do whatever you want to do with the data

python - 在大型数据集上使用 csv 来分隔参数时遇到问题

2 回答 2

Related

Reference