python - 从巨大的 CSV 文件中读取随机行

Question

我有这个相当大的 CSV 文件（15 Gb），我需要从中读取大约 100 万行随机行。据我所见 - 并实现 - Python 中的 CSV 实用程序只允许在文件中按顺序迭代。

将所有文件读入内存以使用一些随机选择非常耗时，并且遍历所有文件并丢弃一些值并选择其他值非常耗时，所以有什么方法可以从 CSV 文件中选择一些随机行和只读那一行？

我试过没有成功：

import csv

with open('linear_e_LAN2A_F_0_435keV.csv') as file:
    reader = csv.reader(file)
    print reader[someRandomInteger]

CSV 文件示例：

331.093,329.735
251.188,249.994
374.468,373.782
295.643,295.159
83.9058,0
380.709,116.221
352.238,351.891
183.809,182.615
257.277,201.302
61.4598,40.7106

score 32 · Accepted Answer

import random

filesize = 1500                 #size of the really big file
offset = random.randrange(filesize)

f = open('really_big_file')
f.seek(offset)                  #go to random position
f.readline()                    # discard - bound to be partial line
random_line = f.readline()      # bingo!

# extra to handle last/first line edge cases
if len(random_line) == 0:       # we have hit the end
    f.seek(0)
    random_line = f.readline()  # so we'll grab the first line instead

正如@AndreBoos 指出的那样，这种方法将导致有偏见的选择。如果您知道线的最小和最大长度，则可以通过执行以下操作来消除此偏差：

假设（在这种情况下）我们有 min=3 和 max=15

1）求上一行的长度（Lp）。

那么如果 Lp = 3，则这条线的偏差最大。因此，如果 Lp = 15，我们应该在 100% 的情况下采用它，这条线最偏向。我们应该只选择 20% 的时间，因为它被选中的可能性要高 5 倍。

我们通过随机保持 X% 的时间来实现这一点，其中：

X = 最小值 / Lp

如果我们不遵守这条线，我们会再随机选择一次，直到我们的骰子掷好为止。:-)

score 10 · Accepted Answer

我有这个非常大的 CSV 文件（15 Gb），我需要从中读取大约 100 万行随机行

假设您不需要正好100 万行并且事先知道 CSV 文件中的行数，您可以使用水库采样来检索您的随机子集。只需遍历您的数据，并为每一行确定该行被选中的机会。这样，您只需要一次数据传递。

如果您需要经常提取随机样本但实际数据集不经常更改（因为您只需要在每次数据集更改时跟踪条目数），则此方法效果很好。

chances_selected = desired_num_results / total_entries
for line in csv.reader(file):
   if random() < chances_selected:
        result.append(line)

score 7 · Accepted Answer

您可以使用概率方法的变体来选择文件中的随机行。

您可以保留一个 size 的缓冲区，而不是只保留一个被选中的数字C。n对于带有行的文件中的每个行号，N您希望以概率C/n（而不是原始的）选择该行1/n。如果选择了该行，则从 C 长度缓冲区中选择一个随机位置以逐出。

以下是它的工作原理：

import random

C = 2
fpath = 'somelines.txt'
buffer = []

f = open(fpath, 'r')
for line_num, line in enumerate(f):
    n = line_num + 1.0
    r = random.random()
    if n <= C:
        buffer.append(line.strip())
    elif r < C/n:
        loc = random.randint(0, C-1)
        buffer[loc] = line.strip()

这需要一次通过文件（因此它是线性时间）并从文件中准确返回行。 C每条线都有C/N被选中的概率。

为了验证上述方法是否有效，我创建了一个包含 a、b、c、d、e 的 5 行文件。我用 C=2 运行代码 10,000 次。这应该会产生大约 5 个选择 2（所以 10）个可能的选择的均匀分布。结果：

a,b: 1046
b,c: 1018
b,e: 1014
a,c: 1003
c,d: 1002
d,e: 1000
c,e: 993
a,e: 992
a,d: 985
b,d: 947

score 4 · Accepted Answer

如果您想多次抓取随机行（例如，用于机器学习的小批量），并且您不介意扫描一次巨大的文件（不将其加载到内存中），那么您可以创建一个行索引列表和使用 seek 快速抓住线（基于 Maria Zverina 的回答）。

# Overhead:
# Read the line locations into memory once.  (If the lines are long,
# this should take substantially less memory than the file itself.)
fname = 'big_file'
s = [0]
linelocs = [s.append(s[0]+len(n)) or s.pop(0) for n in open(fname)]
f = open(fname) # Reopen the file.

# Each subsequent iteration uses only the code below:
# Grab a 1,000,000 line sample
# I sorted these because I assume the seeks are faster that way.
chosen = sorted(random.sample(linelocs, 1000000))
sampleLines = []
for offset in chosen:
  f.seek(offset)
  sampleLines.append(f.readline())
# Now we can randomize if need be.
random.shuffle(sampleLines)

score 2 · Accepted Answer

如果这些行是真正的 .csv 格式而不是固定字段，那么不，没有。您可以浏览文件一次，为每行索引字节偏移量，然后在以后需要时仅使用索引集，但是无法先验地预测任意 csv 文件的行终止 \n 字符的确切位置。

score 2 · Accepted Answer

如果您知道总行数，则可以使用另一种解决方案 - 生成 100 万个随机数 ( random.sample(xrange(n), 1000000))，直到作为一组总行数，然后使用：

for i, line in enumerate(csvfile):
    if i in lines_to_grab:
        yield line

这将以不偏不倚的方式为您提供恰好 100 万行，但您需要事先获得行数。

score 1 · Accepted Answer

如果您可以将此数据放在 sqlite3 数据库中，则选择一些随机行是微不足道的。您无需预先读取或填充文件中的行。由于 sqlite 数据文件是二进制的，因此您的数据文件将比 CSV 文本小 1/3 到 1/2。

您可以使用THIS之类的脚本来导入 CSV 文件，或者更好的是，首先将数据写入数据库表。SQLITE3是 Python 发行版的一部分。

然后使用这些语句获取 1,000,000 个随机行：

mydb='csv.db'
con=sqlite3.connect(mydb)

with con:
    cur=con.cursor()
    cur.execute("SELECT * FROM csv ORDER BY RANDOM() LIMIT 1000000;")

    for row in cur.fetchall():
        # now you have random rows...

score 0 · Accepted Answer

你可以用固定长度的记录重写文件，然后对中间文件进行随机访问：

ifile = file.open("inputfile.csv")
ofile = file.open("intermediatefile.csv",'w')
for line in ifile:
    ofile.write(line.rstrip('\n').ljust(15)+'\n')

然后，你可以这样做：

import random
ifile = file.open("intermediatefile.csv")
lines = []
samples = random.sample(range(nlines))
for sample in samples:
    ifile.seek(sample)
    lines.append(ifile.readline())

需要更多的磁盘空间，并且第一个程序可能需要一些时间才能运行，但它允许以后无限制地随机访问第二个程序的记录。

score 0 · Accepted Answer

# pass 1, count the number of rows in the file
rowcount = sum(1 for line in file)
# pass 2, select random lines
file.seek(0)
remaining = 1000000
for row in csv.reader(file):
    if random.randrange(rowcount) < remaining:
        print row
        remaining -= 1
    rowcount -= 1

score 0 · Accepted Answer

在这种方法中，我们生成一个随机数集，其元素数等于要读取的行数，其范围是数据中存在的行数。然后从最小到最大排序并存储。

然后逐行读取 csv 文件，并用 aline_counter表示行号。然后line_counter使用排序的随机数列表的第一个元素进行检查，如果它们相同，则将该特定行写入新的 csv 文件，并从列表中删除第一个元素，之前的第二个元素代替第一个元素并且循环继续。

import random
k=random.sample(xrange(No_of_rows_in_data),No_of_lines_to_be_read)
Num=sorted(k)    
line_counter = 0

with open(input_file,'rb') as file_handle:
    reader = csv.reader(file_handle)
    with open(output_file,'wb') as outfile:
            a=csv.writer(outfile)
            for line in reader:
                line_counter += 1
                if line_counter == Num[0]:
                a.writerow(line)
                Num.remove(Num[0])
                if len(Num)==0:
                break

score 0 · Accepted Answer

如果您可以使用pandasand numpy，我在另一个pandas具体但非常有效的问题中发布了一个解决方案：

import pandas as pd
import numpy as np

filename = "data.csv"
sample_size = 1000000
batch_size = 5000

rng = np.random.default_rng()

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

sample = sample_reader.get_chunk(sample_size)

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))
    sample.loc[chunk.index] = chunk

更多详情，请查看其他答案。

score 0 · Accepted Answer

def random_line(path, hint=1):
    with open(path, mode='rb') as file:
        import random
        while file.seek(random.randrange(file.seek(-2, 2))) and not file.readline(hint).endswith(b'\n'):
            pass
        return file.readline().decode().strip()

这是我为从一个非常大的文件中读取随机行而写的。

时间复杂度为 O(k) ，k 是文本文件中行的平均长度。

提示参数是文本文件中行的最小长度，如果您事先知道，使用它来加速函数。

score 0 · Accepted Answer

总是为我工作

import csv
import random
randomINT = random.sample(range(1, 72655), 40000)
with open(file.csv,"rU") as fp:
    reader = csv.reader(fp, delimiter=",", quotechar='"', dialect=csv.excel_tab)
    data_read = [row for idx, row in enumerate(reader) if idx in randomINT]
    for idx, line in enumerate(data_read):
        pass

python - 从巨大的 CSV 文件中读取随机行

13 回答 13

Related

Reference