我使用了一个 ETL 工具,它具有 python2.6 作为内置脚本语言,所以当我需要将一个大文件拆分成块以进行下游处理时。这似乎是一个显而易见的选择。我最初使用 python 2.6 安装在我的 macbook ( osx 10.8 ) 上编写和测试了脚本。
当我把它移到 Windows 上时,我很惊讶,因为它的运行速度慢了 10 倍……甚至是企业级服务器(32 核 64GB 光纤通道 SAN 等)。
当试图缩小差异所在时,mac osx 在注释掉写入时几乎没有差异,而 windows 增加 > 5x
osx和windows之间有一些基本的文件IO区别吗?
感激地接受任何帮助:)
import os
import sys
import re
from time import time
t = time()
"""
# Split a pre sorted text file into multiple outputs based on the leftmost element
# delimited by spaces.
# The second element can be used for an additional sort and will stripped from the
# output when 'isLeadingSort=1'
#
# parameter:
# path: char path for the input file
# outPath: char path for the output files
# isLeadingSort int use the 2nd of 3rd element as output data
# isdbg int enable debug prints
"""
# Just use the cmd at the moment for test
path= sys.argv[1]
outPath = sys.argv[2]
isLeadingSort = int(sys.argv[3])
isdbg = int(sys.argv[4])
#outPath = os.getcwd()
#isLeadingSort = 0
#isdbg = 0
# define all the functions up front
def printStr(str):
""" print when the debug option is set """
if isdbg:
print (str)
def testPath(path):
"""raise an exception if we cant find the path or file"""
if not os.path.exists(path):
raise Exception ('File not found: ' + path )
return false
#
# This is where we start
#
# check that the paths exist or raise an exception
testPath(path)
testPath(outPath)
printStr ('paths ok')
#init
arline = []
fnameOut = chr(1) # init the output filename
line=object()
fOut=object()
# open the input file for reading and process though in a loop
with open(path,'r') as f:
for line in f:
printStr( 'for line in f: ' )
if isLeadingSort:
wrds=2
else:
wrds=1
arLine = re.split('[ \n]+',line,wrds)
newFname = arLine[0]
outLine = arLine[len(arLine)-1]
if newFname == fnameOut:
printStr ('writing to open file: ' + fnameOut)
else:
fnameOut = newFname
printStr ('opennextfile: ' + fnameOut + '- closing: ' + str(fOut) )
try:
fOut.close()
except:
pass
if fnameOut in ('' , '\n'):
raise Exception ('Filename is not the first element of the data: ' )
fOut = open(os.path.join(outPath,fnameOut),'w') # open new
#write
fOut.write(outLine)
try:
fOut.close()
except:
pass
print ( 'timediff : ' + str(time() - t))