4

在以下代码中,连接后重命名字段对脚本计算时间的影响有多大?是否在 Pig 中进行了优化?还是真的会遍历每条记录?

-- tables A: (f1, f2, id)  and B: (g1, g2, id) to be joined by id
C = JOIN A BY id, B by id;
C = FOREACH C GENERATE A::f1 AS f1, A::f2 AS f2, B::id AS id, B::g1 AS g1, B::g2 AS g2;

FOREACH该命令是否遍历 C 的每条记录?如果是,有没有办法优化?

谢谢。

4

1 回答 1

9

不要担心优化这个,重命名字段可能会有一点开销,但它不会触发额外的 Map/Reduce 作业。场投影将在您的JOIN.

考虑下面给出的两段代码和 Map Reduce 计划explain

不重命名

A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);

C = join A by id, B by id;

store C into 'output';

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node scope-30
Map Plan
Union[tuple] - scope-31
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
|   |   |
|   |   Project[bytearray][2] - scope-21
|   |
|   |---A: New For Each(false,false,false)[bag] - scope-7
|       |   |
|       |   Project[bytearray][0] - scope-1
|       |   |
|       |   Project[bytearray][1] - scope-3
|       |   |
|       |   Project[bytearray][2] - scope-5
|       |
|       |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
    |   |
    |   Project[bytearray][2] - scope-23
    |
    |---B: New For Each(false,false,false)[bag] - scope-15
        |   |
        |   Project[bytearray][0] - scope-9
        |   |
        |   Project[bytearray][1] - scope-11
        |   |
        |   Project[bytearray][2] - scope-13
        |
        |---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
----------------

重命名

A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);

C = join A by id, B by id;
C = foreach C generate A::f1 as f1,  -- This
                       A::f2 as f2,  -- section
                       B::id as id,  -- is
                       B::g1 as g1,  -- different
                       B::g2 as g2;  --

store C into 'output';

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node scope-41
Map Plan
Union[tuple] - scope-42
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
|   |   |
|   |   Project[bytearray][2] - scope-21
|   |
|   |---A: New For Each(false,false,false)[bag] - scope-7
|       |   |
|       |   Project[bytearray][0] - scope-1
|       |   |
|       |   Project[bytearray][1] - scope-3
|       |   |
|       |   Project[bytearray][2] - scope-5
|       |
|       |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
    |   |
    |   Project[bytearray][2] - scope-23
    |
    |---B: New For Each(false,false,false)[bag] - scope-15
        |   |
        |   Project[bytearray][0] - scope-9
        |   |
        |   Project[bytearray][1] - scope-11
        |   |
        |   Project[bytearray][2] - scope-13
        |
        |---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
    |   |
    |   Project[bytearray][0] - scope-27
    |   |
    |   Project[bytearray][1] - scope-29
    |   |
    |   Project[bytearray][5] - scope-31
    |   |
    |   Project[bytearray][3] - scope-33
    |   |
    |   Project[bytearray][4] - scope-35
    |
    |---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
----------------

不同之处在于减少计划。不重命名:

Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false

与重命名相比:

Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
    |   |
    |   Project[bytearray][0] - scope-27
    |   |
    |   Project[bytearray][1] - scope-29
    |   |
    |   Project[bytearray][5] - scope-31
    |   |
    |   Project[bytearray][3] - scope-33
    |   |
    |   Project[bytearray][4] - scope-35
    |
    |---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false

In short, there will be other things you can optimize in your script before worrying about renaming. Since you'll be going through every record anyway because of the join, renaming will just be a cheap extra step.

于 2012-08-07T16:05:23.530 回答