不要担心优化这个,重命名字段可能会有一点开销,但它不会触发额外的 Map/Reduce 作业。场投影将在您的JOIN.
考虑下面给出的两段代码和 Map Reduce 计划explain。
不重命名
A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);
C = join A by id, B by id;
store C into 'output';
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-30
Map Plan
Union[tuple] - scope-31
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
| | |
| | Project[bytearray][2] - scope-21
| |
| |---A: New For Each(false,false,false)[bag] - scope-7
| | |
| | Project[bytearray][0] - scope-1
| | |
| | Project[bytearray][1] - scope-3
| | |
| | Project[bytearray][2] - scope-5
| |
| |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
| |
| Project[bytearray][2] - scope-23
|
|---B: New For Each(false,false,false)[bag] - scope-15
| |
| Project[bytearray][0] - scope-9
| |
| Project[bytearray][1] - scope-11
| |
| Project[bytearray][2] - scope-13
|
|---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
----------------
重命名
A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);
C = join A by id, B by id;
C = foreach C generate A::f1 as f1, -- This
A::f2 as f2, -- section
B::id as id, -- is
B::g1 as g1, -- different
B::g2 as g2; --
store C into 'output';
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-41
Map Plan
Union[tuple] - scope-42
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
| | |
| | Project[bytearray][2] - scope-21
| |
| |---A: New For Each(false,false,false)[bag] - scope-7
| | |
| | Project[bytearray][0] - scope-1
| | |
| | Project[bytearray][1] - scope-3
| | |
| | Project[bytearray][2] - scope-5
| |
| |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
| |
| Project[bytearray][2] - scope-23
|
|---B: New For Each(false,false,false)[bag] - scope-15
| |
| Project[bytearray][0] - scope-9
| |
| Project[bytearray][1] - scope-11
| |
| Project[bytearray][2] - scope-13
|
|---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
| |
| Project[bytearray][0] - scope-27
| |
| Project[bytearray][1] - scope-29
| |
| Project[bytearray][5] - scope-31
| |
| Project[bytearray][3] - scope-33
| |
| Project[bytearray][4] - scope-35
|
|---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
----------------
不同之处在于减少计划。不重命名:
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
与重命名相比:
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
| |
| Project[bytearray][0] - scope-27
| |
| Project[bytearray][1] - scope-29
| |
| Project[bytearray][5] - scope-31
| |
| Project[bytearray][3] - scope-33
| |
| Project[bytearray][4] - scope-35
|
|---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
In short, there will be other things you can optimize in your script before worrying about renaming. Since you'll be going through every record anyway because of the join, renaming will just be a cheap extra step.