我正在阅读 Tom White 的 O'Reilly Hadoop 书中的 Hive 教程。我正在尝试制作一个桶表,但我无法让 Hive 创建桶。我可以创建表并将数据加载到其中,但所有数据随后都存储在一个文件中。
我正在运行一个伪分布式 Hadoop 集群。我将 Hadoop 1.2.1 和 Hive 0.10.0 与 MySql 元存储一起使用。
数据(如下所示)最初位于“用户”表中。它们将被放在一个有 4 个桶的表中,即每个桶一个用户。
select * from users;
OK
id name
0 Nat
2 Joe
3 Kay
4 Ann
我设置了下面的属性以尝试强制执行分桶(我不认为显式设置 mapred.reduce.tasks 是必要的,但我将其包括在内以防万一)。
set hive.enforce.bucketing=true;
set mapred.reduce.tasks=4;
然后我创建表'bucketed_users'并将数据加载到其中。
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id)
SORTED BY (id ASC) INTO 4 BUCKETS;
INSERT OVERWRITE TABLE bucketed_users SELECT * FROM users;
输出:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/katrina/katrina_20131003204949_a56048f5-ab2f-421b-af45-9ec3ff85731c.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2013-10-03 20:49:34,011 null map = 0%, reduce = 0%
2013-10-03 20:49:35,026 null map = 0%, reduce = 100%
Ended Job = job_local1250355097_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Loading data to table records.bucketed_users
Deleted hdfs://localhost/user/hive/warehouse/records/bucketed_users
Table records.bucketed_users stats: [num_partitions: 0, num_files: 1, num_rows: 4, total_size: 24, raw_data_size: 20]
OK
id name
Time taken: 8.527 seconds
数据已正确加载到“bucketed_users”中(SELECT * FROM bucketed_users
显示所有用户),但创建的文件数量仅为 1 个(num_files: 1
上图)而不是所需的 4 个。查看 HDFS 中的 bucketed_users 目录(dfs -ls /user/hive/warehouse/records/bucketed_users;
)只显示一个文件,000000_0。如何强制执行分桶?
完整的日志如下:
2013-10-03 20:49:30,769 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - Execution log at: /tmp/katrina/katrina_20131003204949_a56048f5-ab2f-421b-af45-9ec3ff85731c.log
2013-10-03 20:49:31,139 INFO exec.ExecDriver (ExecDriver.java:execute(328)) - Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
2013-10-03 20:49:31,144 INFO exec.ExecDriver (ExecDriver.java:execute(350)) - adding libjars: file:///Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar
2013-10-03 20:49:31,144 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(852)) - Processing alias users
2013-10-03 20:49:31,145 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(870)) - Adding input file hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:31,145 INFO exec.Utilities (Utilities.java:isEmptyPath(1900)) - Content Summary not cached for hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:31,365 WARN util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(52)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-10-03 20:49:32,410 INFO exec.ExecDriver (ExecDriver.java:createTmpDirs(219)) - Making Temp Directory: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/-ext-10000
2013-10-03 20:49:32,420 WARN mapred.JobClient (JobClient.java:copyAndConfigureFiles(746)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-03 20:49:32,648 WARN snappy.LoadSnappy (LoadSnappy.java:<clinit>(46)) - Snappy native library not loaded
2013-10-03 20:49:32,655 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(370)) - CombineHiveInputSplit creating pool for hdfs://localhost/user/hive/warehouse/records/users; using filter path hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:32,661 INFO mapred.FileInputFormat (FileInputFormat.java:listStatus(199)) - Total input paths to process : 1
2013-10-03 20:49:32,716 INFO io.CombineHiveInputFormat (CombineHiveInputFormat.java:getSplits(411)) - number of splits 1
2013-10-03 20:49:32,847 INFO filecache.TrackerDistributedCacheManager (TrackerDistributedCacheManager.java:downloadCacheObject(423)) - Creating hive-builtins-0.10.0.jar in /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar-work--7485859847513724632 with rwxr-xr-x
2013-10-03 20:49:32,850 INFO filecache.TrackerDistributedCacheManager (TrackerDistributedCacheManager.java:downloadCacheObject(435)) - Extracting /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar-work--7485859847513724632/hive-builtins-0.10.0.jar to /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar-work--7485859847513724632
2013-10-03 20:49:32,870 INFO filecache.TrackerDistributedCacheManager (TrackerDistributedCacheManager.java:downloadCacheObject(463)) - Cached file:///Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar as /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar
2013-10-03 20:49:32,880 INFO filecache.TrackerDistributedCacheManager (TrackerDistributedCacheManager.java:localizePublicCacheObject(486)) - Cached file:///Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar as /tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar
2013-10-03 20:49:32,987 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - Job running in-process (local Hadoop)
2013-10-03 20:49:33,034 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(340)) - Waiting for map tasks
2013-10-03 20:49:33,037 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(204)) - Starting task: attempt_local1250355097_0001_m_000000_0
2013-10-03 20:49:33,073 INFO mapred.Task (Task.java:initialize(534)) - Using ResourceCalculatorPlugin : null
2013-10-03 20:49:33,077 INFO mapred.MapTask (MapTask.java:updateJobWithSplit(455)) - Processing split: Paths:/user/hive/warehouse/records/users/users.txt:0+24InputFormatClass: org.apache.hadoop.mapred.TextInputFormat
2013-10-03 20:49:33,093 INFO io.HiveContextAwareRecordReader (HiveContextAwareRecordReader.java:initIOContext(154)) - Processing file hdfs://localhost/user/hive/warehouse/records/users/users.txt
2013-10-03 20:49:33,093 INFO mapred.MapTask (MapTask.java:runOldMapper(419)) - numReduceTasks: 1
2013-10-03 20:49:33,099 INFO mapred.MapTask (MapTask.java:<init>(949)) - io.sort.mb = 100
2013-10-03 20:49:33,541 INFO mapred.MapTask (MapTask.java:<init>(961)) - data buffer = 79691776/99614720
2013-10-03 20:49:33,542 INFO mapred.MapTask (MapTask.java:<init>(962)) - record buffer = 262144/327680
2013-10-03 20:49:33,550 INFO ExecMapper (ExecMapper.java:configure(69)) - maximum memory = 2088435712
2013-10-03 20:49:33,551 INFO ExecMapper (ExecMapper.java:configure(74)) - conf classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,551 INFO ExecMapper (ExecMapper.java:configure(76)) - thread classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,585 INFO exec.MapOperator (MapOperator.java:setChildren(387)) - Adding alias users to work list for file hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:33,587 INFO exec.MapOperator (MapOperator.java:setChildren(402)) - dump TS struct<id:int,name:string>
2013-10-03 20:49:33,588 INFO ExecMapper (ExecMapper.java:configure(91)) -
<MAP>Id =10
<Children>
<TS>Id =0
<Children>
<SEL>Id =1
<Children>
<RS>Id =2
<Parent>Id = 1 null<\Parent>
<\RS>
<\Children>
<Parent>Id = 0 null<\Parent>
<\SEL>
<\Children>
<Parent>Id = 10 null<\Parent>
<\TS>
<\Children>
<\MAP>
2013-10-03 20:49:33,588 INFO exec.MapOperator (Operator.java:initialize(321)) - Initializing Self 10 MAP
2013-10-03 20:49:33,588 INFO exec.TableScanOperator (Operator.java:initialize(321)) - Initializing Self 0 TS
2013-10-03 20:49:33,588 INFO exec.TableScanOperator (Operator.java:initializeChildren(386)) - Operator 0 TS initialized
2013-10-03 20:49:33,589 INFO exec.TableScanOperator (Operator.java:initializeChildren(390)) - Initializing children of 0 TS
2013-10-03 20:49:33,589 INFO exec.SelectOperator (Operator.java:initialize(425)) - Initializing child 1 SEL
2013-10-03 20:49:33,589 INFO exec.SelectOperator (Operator.java:initialize(321)) - Initializing Self 1 SEL
2013-10-03 20:49:33,592 INFO exec.SelectOperator (SelectOperator.java:initializeOp(58)) - SELECT struct<id:int,name:string>
2013-10-03 20:49:33,594 INFO exec.SelectOperator (Operator.java:initializeChildren(386)) - Operator 1 SEL initialized
2013-10-03 20:49:33,595 INFO exec.SelectOperator (Operator.java:initializeChildren(390)) - Initializing children of 1 SEL
2013-10-03 20:49:33,595 INFO exec.ReduceSinkOperator (Operator.java:initialize(425)) - Initializing child 2 RS
2013-10-03 20:49:33,595 INFO exec.ReduceSinkOperator (Operator.java:initialize(321)) - Initializing Self 2 RS
2013-10-03 20:49:33,595 INFO exec.ReduceSinkOperator (ReduceSinkOperator.java:initializeOp(112)) - Using tag = -1
2013-10-03 20:49:33,606 INFO exec.ReduceSinkOperator (Operator.java:initializeChildren(386)) - Operator 2 RS initialized
2013-10-03 20:49:33,606 INFO exec.ReduceSinkOperator (Operator.java:initialize(361)) - Initialization Done 2 RS
2013-10-03 20:49:33,606 INFO exec.SelectOperator (Operator.java:initialize(361)) - Initialization Done 1 SEL
2013-10-03 20:49:33,606 INFO exec.TableScanOperator (Operator.java:initialize(361)) - Initialization Done 0 TS
2013-10-03 20:49:33,607 INFO exec.MapOperator (Operator.java:initialize(361)) - Initialization Done 10 MAP
2013-10-03 20:49:33,637 INFO exec.MapOperator (MapOperator.java:cleanUpInputFileChangedOp(494)) - Processing alias users for file hdfs://localhost/user/hive/warehouse/records/users
2013-10-03 20:49:33,638 INFO exec.MapOperator (Operator.java:forward(774)) - 10 forwarding 1 rows
2013-10-03 20:49:33,638 INFO exec.TableScanOperator (Operator.java:forward(774)) - 0 forwarding 1 rows
2013-10-03 20:49:33,639 INFO exec.SelectOperator (Operator.java:forward(774)) - 1 forwarding 1 rows
2013-10-03 20:49:33,641 INFO ExecMapper (ExecMapper.java:map(148)) - ExecMapper: processing 1 rows: used memory = 114294872
2013-10-03 20:49:33,642 INFO exec.MapOperator (Operator.java:close(549)) - 10 finished. closing...
2013-10-03 20:49:33,643 INFO exec.MapOperator (Operator.java:close(555)) - 10 forwarded 4 rows
2013-10-03 20:49:33,643 INFO exec.MapOperator (Operator.java:logStats(845)) - DESERIALIZE_ERRORS:0
2013-10-03 20:49:33,643 INFO exec.TableScanOperator (Operator.java:close(549)) - 0 finished. closing...
2013-10-03 20:49:33,643 INFO exec.TableScanOperator (Operator.java:close(555)) - 0 forwarded 4 rows
2013-10-03 20:49:33,643 INFO exec.SelectOperator (Operator.java:close(549)) - 1 finished. closing...
2013-10-03 20:49:33,644 INFO exec.SelectOperator (Operator.java:close(555)) - 1 forwarded 4 rows
2013-10-03 20:49:33,644 INFO exec.ReduceSinkOperator (Operator.java:close(549)) - 2 finished. closing...
2013-10-03 20:49:33,644 INFO exec.ReduceSinkOperator (Operator.java:close(555)) - 2 forwarded 0 rows
2013-10-03 20:49:33,644 INFO exec.SelectOperator (Operator.java:close(570)) - 1 Close done
2013-10-03 20:49:33,644 INFO exec.TableScanOperator (Operator.java:close(570)) - 0 Close done
2013-10-03 20:49:33,644 INFO exec.MapOperator (Operator.java:close(570)) - 10 Close done
2013-10-03 20:49:33,645 INFO ExecMapper (ExecMapper.java:close(215)) - ExecMapper: processed 4 rows: used memory = 114767288
2013-10-03 20:49:33,647 INFO mapred.MapTask (MapTask.java:flush(1289)) - Starting flush of map output
2013-10-03 20:49:33,659 INFO mapred.MapTask (MapTask.java:sortAndSpill(1471)) - Finished spill 0
2013-10-03 20:49:33,661 INFO mapred.Task (Task.java:done(858)) - Task:attempt_local1250355097_0001_m_000000_0 is done. And is in the process of commiting
2013-10-03 20:49:33,668 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(466)) - hdfs://localhost/user/hive/warehouse/records/users/users.txt:0+24
2013-10-03 20:49:33,668 INFO mapred.Task (Task.java:sendDone(970)) - Task 'attempt_local1250355097_0001_m_000000_0' done.
2013-10-03 20:49:33,668 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(229)) - Finishing task: attempt_local1250355097_0001_m_000000_0
2013-10-03 20:49:33,668 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(348)) - Map task executor complete.
2013-10-03 20:49:33,680 INFO mapred.Task (Task.java:initialize(534)) - Using ResourceCalculatorPlugin : null
2013-10-03 20:49:33,680 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(466)) -
2013-10-03 20:49:33,690 INFO mapred.Merger (Merger.java:merge(408)) - Merging 1 sorted segments
2013-10-03 20:49:33,695 INFO mapred.Merger (Merger.java:merge(491)) - Down to the last merge-pass, with 1 segments left of total size: 70 bytes
2013-10-03 20:49:33,695 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(466)) -
2013-10-03 20:49:33,697 INFO ExecReducer (ExecReducer.java:configure(100)) - maximum memory = 2088435712
2013-10-03 20:49:33,697 INFO ExecReducer (ExecReducer.java:configure(105)) - conf classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,697 INFO ExecReducer (ExecReducer.java:configure(107)) - thread classpath = [file:/tmp/hadoop-katrina/mapred/local/76384558/archive/-2634153638864376244_689726567_810621743/file/Users/katrina/Code/hive/hive-0.10.0/lib/hive-builtins-0.10.0.jar/]
2013-10-03 20:49:33,698 INFO ExecReducer (ExecReducer.java:configure(149)) -
<OP>Id =3
<Children>
<FS>Id =4
<Parent>Id = 3 null<\Parent>
<\FS>
<\Children>
<\OP>
2013-10-03 20:49:33,698 INFO exec.ExtractOperator (Operator.java:initialize(321)) - Initializing Self 3 OP
2013-10-03 20:49:33,698 INFO exec.ExtractOperator (Operator.java:initializeChildren(386)) - Operator 3 OP initialized
2013-10-03 20:49:33,698 INFO exec.ExtractOperator (Operator.java:initializeChildren(390)) - Initializing children of 3 OP
2013-10-03 20:49:33,698 INFO exec.FileSinkOperator (Operator.java:initialize(425)) - Initializing child 4 FS
2013-10-03 20:49:33,699 INFO exec.FileSinkOperator (Operator.java:initialize(321)) - Initializing Self 4 FS
2013-10-03 20:49:33,701 INFO exec.FileSinkOperator (Operator.java:initializeChildren(386)) - Operator 4 FS initialized
2013-10-03 20:49:33,701 INFO exec.FileSinkOperator (Operator.java:initialize(361)) - Initialization Done 4 FS
2013-10-03 20:49:33,701 INFO exec.ExtractOperator (Operator.java:initialize(361)) - Initialization Done 3 OP
2013-10-03 20:49:33,706 INFO ExecReducer (ExecReducer.java:reduce(243)) - ExecReducer: processing 1 rows: used memory = 117749816
2013-10-03 20:49:33,707 INFO exec.ExtractOperator (Operator.java:forward(774)) - 3 forwarding 1 rows
2013-10-03 20:49:33,707 INFO exec.FileSinkOperator (FileSinkOperator.java:createBucketFiles(458)) - Final Path: FS hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000/000000_0
2013-10-03 20:49:33,707 INFO exec.FileSinkOperator (FileSinkOperator.java:createBucketFiles(460)) - Writing to temp file: FS hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_task_tmp.-ext-10000/_tmp.000000_0
2013-10-03 20:49:33,707 INFO exec.FileSinkOperator (FileSinkOperator.java:createBucketFiles(481)) - New Final Path: FS hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000/000000_0
2013-10-03 20:49:33,737 INFO ExecReducer (ExecReducer.java:close(301)) - ExecReducer: processed 4 rows: used memory = 118477400
2013-10-03 20:49:33,737 INFO exec.ExtractOperator (Operator.java:close(549)) - 3 finished. closing...
2013-10-03 20:49:33,737 INFO exec.ExtractOperator (Operator.java:close(555)) - 3 forwarded 4 rows
2013-10-03 20:49:33,737 INFO exec.FileSinkOperator (Operator.java:close(549)) - 4 finished. closing...
2013-10-03 20:49:33,737 INFO exec.FileSinkOperator (Operator.java:close(555)) - 4 forwarded 0 rows
2013-10-03 20:49:33,990 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - Hadoop job information for null: number of mappers: 0; number of reducers: 0
2013-10-03 20:49:34,011 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - 2013-10-03 20:49:34,011 null map = 0%, reduce = 0%
2013-10-03 20:49:34,111 INFO jdbc.JDBCStatsPublisher (JDBCStatsPublisher.java:publishStat(137)) - Stats publishing for key hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/-ext-10000/000000
2013-10-03 20:49:34,143 INFO exec.FileSinkOperator (Operator.java:logStats(845)) - TABLE_ID_1_ROWCOUNT:4
2013-10-03 20:49:34,143 INFO exec.ExtractOperator (Operator.java:close(570)) - 3 Close done
2013-10-03 20:49:34,145 INFO mapred.Task (Task.java:done(858)) - Task:attempt_local1250355097_0001_r_000000_0 is done. And is in the process of commiting
2013-10-03 20:49:34,146 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(466)) - reduce > reduce
2013-10-03 20:49:34,147 INFO mapred.Task (Task.java:sendDone(970)) - Task 'attempt_local1250355097_0001_r_000000_0' done.
2013-10-03 20:49:35,026 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - 2013-10-03 20:49:35,026 null map = 0%, reduce = 100%
2013-10-03 20:49:35,030 INFO exec.ExecDriver (SessionState.java:printInfo(392)) - Ended Job = job_local1250355097_0001
2013-10-03 20:49:35,033 INFO exec.FileSinkOperator (Utilities.java:mvFileToFinalPath(1361)) - Moving tmp dir: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000 to: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000.intermediate
2013-10-03 20:49:35,036 INFO exec.FileSinkOperator (Utilities.java:mvFileToFinalPath(1372)) - Moving tmp dir: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/_tmp.-ext-10000.intermediate to: hdfs://localhost/tmp/hive-katrina/hive_2013-10-03_20-49-28_110_131412476548383989/-ext-10000