hadoop - 使用 Java 运行 EmbeddedPig 时，Pig 脚本中的 ORDER BY 作业失败

Question

我有以下猪脚本，它使用 grunt shell 完美运行（将结果存储到 HDFS 没有任何问题）；但是，如果我使用 Java EmbeddedPig 运行相同的脚本，最后一个作业 (ORDER BY) 会失败。如果我将 ORDER BY 作业替换为其他作业，例如 GROUP 或 FOREACH GENERATE，则整个脚本在 Java EmbeddedPig 中成功。所以我认为是 ORDER BY 导致了这个问题。有人有这方面的经验吗？任何帮助，将不胜感激！

猪脚本：

    REGISTER pig-udf-0.0.1-SNAPSHOT.jar;
    user_similarity = LOAD '/tmp/sample-sim-score-results-31/part-r-00000' USING PigStorage('\t') AS (user_id: chararray, sim_user_id: chararray, basic_sim_score: float, alt_sim_score: float);
    simplified_user_similarity = FOREACH user_similarity GENERATE $0 AS user_id, $1 AS sim_user_id, $2 AS sim_score;
    grouped_user_similarity = GROUP simplified_user_similarity BY user_id;
    ordered_user_similarity = FOREACH grouped_user_similarity {                           
        sorted = ORDER simplified_user_similarity BY sim_score DESC;
        top    = LIMIT sorted 10;
        GENERATE group, top;
    };
    top_influencers = FOREACH ordered_user_similarity GENERATE com.aol.grapevine.similarity.pig.udf.AssignPointsToTopInfluencer($1, 10);
    all_influence_scores = FOREACH top_influencers GENERATE FLATTEN($0);
    grouped_influence_scores = GROUP all_influence_scores BY bag_of_topSimUserTuples::user_id;
    influence_scores = FOREACH grouped_influence_scores GENERATE group AS user_id, SUM(all_influence_scores.bag_of_topSimUserTuples::points) AS influence_score;
    ordered_influence_scores = ORDER influence_scores BY influence_score DESC;
    STORE ordered_influence_scores INTO '/tmp/cc-test-results-1' USING PigStorage();

Pig的错误日志：

12/04/05 10:00:56 INFO pigstats.ScriptState: Pig script settings are added to the job
12/04/05 10:00:56 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
12/04/05 10:00:58 INFO mapReduceLayer.JobControlCompiler: Setting up single store job
12/04/05 10:00:58 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
12/04/05 10:00:58 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission.
12/04/05 10:00:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/04/05 10:00:58 INFO input.FileInputFormat: Total input paths to process : 1
12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths to process : 1
12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths (combined) to process : 1
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating tmp-1546565755 in /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134-work-6955502337234509704 with rwxr-xr-x
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:58 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir.
12/04/05 10:00:58 INFO mapred.TaskRunner: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/pigsample_854728855_1333645258470
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.jar.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.jar.crc
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.split.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.split.crc
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.splitmetainfo.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.splitmetainfo.crc
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.xml.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.xml.crc
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.jar <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.jar
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.split <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.split
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.splitmetainfo <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.splitmetainfo
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.xml <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.xml
12/04/05 10:00:59 INFO mapred.Task:  Using ResourceCalculatorPlugin : null
12/04/05 10:00:59 INFO mapred.MapTask: io.sort.mb = 100
12/04/05 10:00:59 INFO mapred.MapTask: data buffer = 79691776/99614720
12/04/05 10:00:59 INFO mapred.MapTask: record buffer = 262144/327680
12/04/05 10:00:59 WARN mapred.LocalJobRunner: job_local_0004
java.lang.RuntimeException: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:139)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:560)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInputFormat.listStatus(PigFileInputFormat.java:37)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
    at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)
    at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:112)
    ... 6 more
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Deleted path /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:59 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0004
12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: job job_local_0004 has failed! Stop running all dependent jobs
12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: 100% complete
12/04/05 10:01:04 ERROR pigstats.PigStatsUtil: 1 map reduce job(s) failed!
12/04/05 10:01:04 INFO pigstats.PigStats: Script Statistics:

HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
0.20.2-cdh3u3    0.8.1-cdh3u3    cchuang    2012-04-05 10:00:34    2012-04-05 10:01:04    GROUP_BY,ORDER_BY

Some jobs have failed! Stop running all dependent jobs

Job Stats (time in seconds):
JobId    Maps    Reduces    MaxMapTime    MinMapTIme    AvgMapTime    MaxReduceTime    MinReduceTime    AvgReduceTime    Alias    Feature    Outputs
job_local_0001    0    0    0    0    0    0    0    0    all_influence_scores,grouped_user_similarity,simplified_user_similarity,user_similarity    GROUP_BY   
job_local_0002    0    0    0    0    0    0    0    0    grouped_influence_scores,influence_scores    GROUP_BY,COMBINER   
job_local_0003    0    0    0    0    0    0    0    0    ordered_influence_scores    SAMPLER   

Failed Jobs:
JobId    Alias    Feature    Message    Outputs
job_local_0004    ordered_influence_scores    ORDER_BY    Message: Job failed! Error - NA    /tmp/cc-test-results-1,

Input(s):
Successfully read 0 records from: "/tmp/sample-sim-score-results-31/part-r-00000"

Output(s):
Failed to produce result in "/tmp/cc-test-results-1"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_local_0001    ->    job_local_0002,
job_local_0002    ->    job_local_0003,
job_local_0003    ->    job_local_0004,
job_local_0004


12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: Some jobs have failed! Stop running all dependent jobs

score 0 · Accepted Answer

0

确保 PIG_HOME 环境变量设置为 pig 安装。

于 2013-08-03T18:12:16.083 回答

hadoop - 使用 Java 运行 EmbeddedPig 时，Pig 脚本中的 ORDER BY 作业失败

1 回答 1

Related

Reference