2

我使用了一个 ETL 工具,它具有 python2.6 作为内置脚本语言,所以当我需要将一个大文件拆分成块以进行下游处理时。这似乎是一个显而易见的选择。我最初使用 python 2.6 安装在我的 macbook ( osx 10.8 ) 上编写和测试了脚本。

当我把它移到 Windows 上时,我很惊讶,因为它的运行速度慢了 10 倍……甚至是企业级服务器(32 核 64GB 光纤通道 SAN 等)。

当试图缩小差异所在时,mac osx 在注释掉写入时几乎没有差异,而 windows 增加 > 5x

osx和windows之间有一些基本的文件IO区别吗?

感激地接受任何帮助:)

import os
import sys
import re
from time import time

t = time()


"""
# Split a pre sorted text file into multiple outputs based on the leftmost element
# delimited by spaces.
# The second element can be used for an additional sort and will stripped from the 
# output when 'isLeadingSort=1'
#
# parameter:
#       path:           char    path for the input file
#       outPath:        char    path for the output files
#       isLeadingSort   int     use the 2nd of 3rd element as output data
#       isdbg           int     enable debug prints
"""

# Just use the cmd at the moment for test
path= sys.argv[1]
outPath = sys.argv[2]
isLeadingSort = int(sys.argv[3])
isdbg = int(sys.argv[4])

#outPath = os.getcwd()
#isLeadingSort = 0
#isdbg = 0

# define all the functions up front

def printStr(str):
    """ print when the debug option is set """
    if isdbg:
        print (str)

def testPath(path):
    """raise an exception if we cant find the path or file"""
    if not os.path.exists(path):
        raise Exception ('File not found: ' + path )
        return false


#
# This is where we start
#
# check that the paths exist or raise an exception

testPath(path)
testPath(outPath)

printStr ('paths ok')

#init
arline = []
fnameOut = chr(1) # init the output filename
line=object()
fOut=object()


# open the input file for reading and process though in a loop
with open(path,'r') as f:
    for line in f:
        printStr( 'for line in f: ' )

        if isLeadingSort:
            wrds=2
        else:
            wrds=1
        arLine = re.split('[ \n]+',line,wrds)
        newFname = arLine[0]
        outLine = arLine[len(arLine)-1]

        if newFname == fnameOut:
            printStr ('writing to open file: ' + fnameOut)  
        else:
            fnameOut = newFname
            printStr ('opennextfile: ' + fnameOut + '- closing: ' + str(fOut) )
            try:
                fOut.close()
            except:
                pass

            if fnameOut in ('' , '\n'):
                raise Exception ('Filename is not the first element of the data: '  ) 
            fOut = open(os.path.join(outPath,fnameOut),'w') # open new

        #write
        fOut.write(outLine)

    try:
        fOut.close()
    except:
        pass


print ( 'timediff : ' + str(time() - t))
4

0 回答 0