这是一个可用于拆分大文件的 python 脚本subprocess
:
"""
Splits the file into the same directory and
deletes the original file
"""
import subprocess
import sys
import os
SPLIT_FILE_CHUNK_SIZE = '5000'
SPLIT_PREFIX_LENGTH = '2' # subprocess expects a string, i.e. 2 = aa, ab, ac etc..
if __name__ == "__main__":
file_path = sys.argv[1]
# i.e. split -a 2 -l 5000 t/some_file.txt ~/tmp/t/
subprocess.call(["split", "-a", SPLIT_PREFIX_LENGTH, "-l", SPLIT_FILE_CHUNK_SIZE, file_path,
os.path.dirname(file_path) + '/'])
# Remove the original file once done splitting
try:
os.remove(file_path)
except OSError:
pass
您可以在外部调用它:
import os
fs_result = os.system("python file_splitter.py {}".format(local_file_path))
您也可以subprocess
直接在程序中导入并运行它。
这种方法的问题是高内存使用:subprocess
创建一个内存占用与您的进程相同大小的分叉,如果您的进程内存已经很重,它会在运行时加倍。与os.system
.
这是另一种纯 python 方法,虽然我没有在大文件上测试过,但它会更慢但更节省内存:
CHUNK_SIZE = 5000
def yield_csv_rows(reader, chunk_size):
"""
Opens file to ingest, reads each line to return list of rows
Expects the header is already removed
Replacement for ingest_csv
:param reader: dictReader
:param chunk_size: int, chunk size
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
with open(local_file_path, 'rb') as f:
f.readline().strip().replace('"', '')
reader = unicodecsv.DictReader(f, fieldnames=header.split(','), delimiter=',', quotechar='"')
chunks = yield_csv_rows(reader, CHUNK_SIZE)
for chunk in chunks:
if not chunk:
break
# Do something with your chunk here
这是另一个使用的示例readlines()
:
"""
Simple example using readlines()
where the 'file' is generated via:
seq 10000 > file
"""
CHUNK_SIZE = 5
def yield_rows(reader, chunk_size):
"""
Yield row chunks
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
def batch_operation(data):
for item in data:
print(item)
with open('file', 'r') as f:
chunks = yield_rows(f.readlines(), CHUNK_SIZE)
for _chunk in chunks:
batch_operation(_chunk)
readlines 示例演示了如何对数据进行分块以将块传递给需要块的函数。不幸的是 readlines 在内存中打开整个文件,最好使用阅读器示例来提高性能。尽管如果您可以轻松地将所需的内容放入内存并需要分块处理它就足够了。