0

就我而言,我有两个 csv 文件(file1 和 file2)。

为了简化我的问题,假设我想连续读取 file1、3 by 3 和 file2 4 by 4 的元素。

file1.csv(9 行)

1,2,3
3,5,8
7,2,9
10,111,12
13,14,155
31,2,3
3,15,82
8,4,91
12,111,13

file2.csv(12 行)

55,12,17
3,6,13
72,1,91
10,0,12
1,1,73
31,2,3
3,15,61
18,6,91
13,33,13
7,1,15
9,17,42
41,8,18

在输出中我想得到:

1,2,3 (from 1. row of file1.csv)
3,5,8 (from 2. row of file1.csv)
7,2,9 (from 3. row of file1.csv)
55,12,17  (from 1. row of file2.csv)
3,6,13  (from 2. row of file2.csv)
72,1,91  (from 3. row of file2.csv)
10,0,12  (from 4. row of file2.csv)
10,111,12  (from 4. row of file1.csv)
13,14,155  (from 5. row of file1.csv)
31,2,3  (from 6. row of file1.csv)
1,1,73  (from 5. row of file2.csv)
31,2,3  (from 6. row of file2.csv)
3,15,61  (from 7. row of file2.csv)
18,6,91  (from 8. row of file2.csv)
3,15,82  (from 7. row of file1.csv)
8,4,91  (from 8. row of file1.csv)
12,111,13  (from 9. row of file1.csv)
13,33,13  (from 9. row of file2.csv)
7,1,15  (from 10. row of file2.csv)
9,17,42  (from 11. row of file2.csv)
41,8,18  (from 12. row of file2.csv)

我的真实数据文件非常大(每个约 1.6 GB),我想尽可能少地使用内存。为此,我编写了一个脚本:

f1, f2, = open(pathInput1, 'r'), open(pathInput2, 'r')
position1, position2 = 0, 0

for i in range(6):
    if i%2 == 0:
        #print("file1.csv")
        sizeOfWindow = 3
        sizeOfWindowInactive = 4
        f1.seek(position1)
        data = []
        for l in range(sizeOfWindow):
            line = f1.readline()
            line = list(map(int, line[:-1].split(",")))
            data.append(line)
        data = np.array(data)
        print(data)
        [next(f2) for i in range(sizeOfWindowInactive)]
        position1 = f1.tell()
    else:
        #print("file2.csv")
        sizeOfWindow = 4
        sizeOfWindowInactive = 3
        f2.seek(position2)
        data = []
        for l in range(sizeOfWindow):
            line = f2.readline()
            line = list(map(int, line[:-1].split(",")))
            data.append(line)
        data = np.array(data)
        print(data)
        [next(f1) for i in range(sizeOfWindowInactive)]
        position2 = f2.tell()

写完这个脚本后,我注意到我不能同时使用readline()next()。现在我的问题是,如何安排我的脚本在不使用太多内存的情况下观察相同的输出。

编辑:在我的真实案例中,我有 5 个文件,每个文件都有自己的 sizeOfWindow。根据我读取的数据,我决定使用 if 语句跳转到文件中。所以 sizeOfWindow 是根据文件固定的。我不经常阅读文件。我决定使用我读取的最后一个数据部分来跳转文件。当我读取一个文件时,我需要移动其他文件的光标而不读取它们的数据。

4

1 回答 1

0

由于您只需要按顺序读取文件,因此您可以根据需要使用next(f1)andnext(f2)来获取所需的行。该itertools模块包含使这更容易的助手。itertools.islice将抓取几行,因此您不需要自己的循环 for next. 并将itertools.cycle在列表中交替项目,因此您无需跟踪下一个文件。把它放在一起:

import itertools
import numpy as np

with open(pathInput1) as f1, open(pathInput2) as f2:
    grab_this = ((3, f1), (4, f2))
    for num, fp in itertools.cycle(grab_this):
        data = np.array(itertools.islice(fp, num))
        if not data:
            break
        print(data)
于 2018-05-03T17:08:02.290 回答