就我而言,我有两个 csv 文件(file1 和 file2)。
为了简化我的问题,假设我想连续读取 file1、3 by 3 和 file2 4 by 4 的元素。
file1.csv(9 行)
1,2,3
3,5,8
7,2,9
10,111,12
13,14,155
31,2,3
3,15,82
8,4,91
12,111,13
file2.csv(12 行)
55,12,17
3,6,13
72,1,91
10,0,12
1,1,73
31,2,3
3,15,61
18,6,91
13,33,13
7,1,15
9,17,42
41,8,18
在输出中我想得到:
1,2,3 (from 1. row of file1.csv)
3,5,8 (from 2. row of file1.csv)
7,2,9 (from 3. row of file1.csv)
55,12,17 (from 1. row of file2.csv)
3,6,13 (from 2. row of file2.csv)
72,1,91 (from 3. row of file2.csv)
10,0,12 (from 4. row of file2.csv)
10,111,12 (from 4. row of file1.csv)
13,14,155 (from 5. row of file1.csv)
31,2,3 (from 6. row of file1.csv)
1,1,73 (from 5. row of file2.csv)
31,2,3 (from 6. row of file2.csv)
3,15,61 (from 7. row of file2.csv)
18,6,91 (from 8. row of file2.csv)
3,15,82 (from 7. row of file1.csv)
8,4,91 (from 8. row of file1.csv)
12,111,13 (from 9. row of file1.csv)
13,33,13 (from 9. row of file2.csv)
7,1,15 (from 10. row of file2.csv)
9,17,42 (from 11. row of file2.csv)
41,8,18 (from 12. row of file2.csv)
我的真实数据文件非常大(每个约 1.6 GB),我想尽可能少地使用内存。为此,我编写了一个脚本:
f1, f2, = open(pathInput1, 'r'), open(pathInput2, 'r')
position1, position2 = 0, 0
for i in range(6):
if i%2 == 0:
#print("file1.csv")
sizeOfWindow = 3
sizeOfWindowInactive = 4
f1.seek(position1)
data = []
for l in range(sizeOfWindow):
line = f1.readline()
line = list(map(int, line[:-1].split(",")))
data.append(line)
data = np.array(data)
print(data)
[next(f2) for i in range(sizeOfWindowInactive)]
position1 = f1.tell()
else:
#print("file2.csv")
sizeOfWindow = 4
sizeOfWindowInactive = 3
f2.seek(position2)
data = []
for l in range(sizeOfWindow):
line = f2.readline()
line = list(map(int, line[:-1].split(",")))
data.append(line)
data = np.array(data)
print(data)
[next(f1) for i in range(sizeOfWindowInactive)]
position2 = f2.tell()
写完这个脚本后,我注意到我不能同时使用readline()
和next()
。现在我的问题是,如何安排我的脚本在不使用太多内存的情况下观察相同的输出。
编辑:在我的真实案例中,我有 5 个文件,每个文件都有自己的 sizeOfWindow。根据我读取的数据,我决定使用 if 语句跳转到文件中。所以 sizeOfWindow 是根据文件固定的。我不经常阅读文件。我决定使用我读取的最后一个数据部分来跳转文件。当我读取一个文件时,我需要移动其他文件的光标而不读取它们的数据。