2

我正在尝试一次加载多个文件。它们都是分区文件当我尝试使用 1 个文件时它可以工作,但是当我列出 24 个文件时,它给了我这个错误,除了在加载后执行联合之外,我找不到任何关于限制的文档和解决方法。有其他选择吗?

下面的代码重新创建问题:

basepath = '/file/' 
paths = ['/file/df201601.orc', '/file/df201602.orc', '/file/df201603.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc', ]   

df = sqlContext.read.format('orc') \
               options(header='true',inferschema='true',basePath=basePath)\
               .load(*paths)

收到错误:

 TypeError                                 Traceback (most recent call last)
 <ipython-input-43-7fb8fade5e19> in <module>()

---> 37 df = sqlContext.read.format('orc')                .options(header='true', inferschema='true',basePath=basePath)                .load(*paths)
     38 

TypeError: load() takes at most 4 arguments (24 given)
4

1 回答 1

4

官方文档中所述,要读取多个文件,您应该传递一个list

path -- 可选字符串或文件系统支持的数据源的字符串列表。

所以在你的情况下:

(sqlContext.read
    .format('orc') 
    .options(basePath=basePath)
    .load(path=paths))

参数解包 ( *) 仅在使用可变参数定义时才有意义load,例如:

def load(this, *paths):
    ...
于 2018-01-19T17:17:30.820 回答