0

csv 文件包含用户偏好的布尔数据(userid、itemid)。预处理器检查文件是否存在任何不一致。我也手动检查过,数据似乎是一致且正确的格式。需要注意的两件事: - 如果 hadoop 作业只有一个输入文件,也就是说,如果所有首选项都导出到单个 csv 中且 (userid,itemid) 没有重复条目,则作业永远不会失败 - 作业随机失败hadoop 目录中有多个 csv 文件,这些文件是用户首选项的初始转储以及用户首选项的每日增量文件。

如果 csv 数据始终一致且正确,则作业不应因 ArrayIndexOutOfBounds 异常而失败。如果 delta 文件中有 (userid,itemid) 的重复条目,作业是否可能会失败。由于布尔首选项,其中许多条目在多个增量文件中重复。

日志似乎没有输出导致错误的数据位。这是日志:

2012-08-09 15:03:22,652 INFO org.apache.hadoop.mapred.JobInProgress: job_201208021510_0221: nMaps=2 nReduces=1 max=-1
2012-08-09 15:03:22,652 INFO org.apache.hadoop.mapred.JobTracker: Job job_201208021510_0221 added successfully for user 'deploy' to queue 'default'
2012-08-09 15:03:22,652 INFO org.apache.hadoop.mapred.AuditLogger: USER=deploy  IP=127.0.0.1    OPERATION=SUBMIT_JOB    TARGET=job_201208021510_0221    RESULT=SUCCESS
2012-08-09 15:03:22,652 INFO org.apache.hadoop.mapred.JobTracker: Initializing job_201208021510_0221
2012-08-09 15:03:22,653 INFO org.apache.hadoop.mapred.JobInProgress: Initializing job_201208021510_0221
2012-08-09 15:03:23,023 INFO org.apache.hadoop.mapred.JobInProgress: jobToken generated and stored with users keys in /zenius/hadoop/tmp/mapred/system/job_201208021510_0221/jobToken
2012-08-09 15:03:23,027 INFO org.apache.hadoop.mapred.JobInProgress: Input size for job job_201208021510_0221 = 56518256. Number of splits = 2
2012-08-09 15:03:23,027 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201208021510_0221_m_000000 has split on node:/default-rack/localhost
2012-08-09 15:03:23,028 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201208021510_0221_m_000001 has split on node:/default-rack/localhost
2012-08-09 15:03:23,028 INFO org.apache.hadoop.mapred.JobInProgress: job_201208021510_0221 LOCALITY_WAIT_FACTOR=1.0
2012-08-09 15:03:23,028 INFO org.apache.hadoop.mapred.JobInProgress: Job job_201208021510_0221 initialized successfully with 2 map tasks and 1 reduce tasks.
2012-08-09 15:03:25,787 INFO org.apache.hadoop.mapred.JobTracker: Adding task (JOB_SETUP) 'attempt_201208021510_0221_m_000003_0' to tip task_201208021510_0221_m_000003, for tracker 'tracker_localhost:localhost/127.0.0.1:50158'
2012-08-09 15:03:31,794 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201208021510_0221_m_000003_0' has completed task_201208021510_0221_m_000003 successfully.
2012-08-09 15:03:31,795 INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201208021510_0221_m_000000_0' to tip task_201208021510_0221_m_000000, for tracker 'tracker_localhost:localhost/127.0.0.1:50158'
2012-08-09 15:03:31,796 INFO org.apache.hadoop.mapred.JobInProgress: Choosing data-local task task_201208021510_0221_m_000000
2012-08-09 15:03:31,796 INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201208021510_0221_m_000001_0' to tip task_201208021510_0221_m_000001, for tracker 'tracker_localhost:localhost/127.0.0.1:50158'
2012-08-09 15:03:31,796 INFO org.apache.hadoop.mapred.JobInProgress: Choosing data-local task task_201208021510_0221_m_000001
2012-08-09 15:03:37,800 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201208021510_0221_m_000001_0' has completed task_201208021510_0221_m_000001 successfully.
2012-08-09 15:03:37,801 INFO org.apache.hadoop.mapred.JobTracker: Adding task (REDUCE) 'attempt_201208021510_0221_r_000000_0' to tip task_201208021510_0221_r_000000, for tracker 'tracker_localhost:localhost/127.0.0.1:50158'
2012-08-09 15:03:49,807 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201208021510_0221_m_000000_0: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:47)
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

2012-08-09 15:03:52,810 INFO org.apache.hadoop.mapred.JobInProgress: Choosing a failed task task_201208021510_0221_m_000000
2012-08-09 15:03:52,810 INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201208021510_0221_m_000000_1' to tip task_201208021510_0221_m_000000, for tracker 'tracker_localhost:localhost/127.0.0.1:50158'
2012-08-09 15:03:52,810 INFO org.apache.hadoop.mapred.JobInProgress: Choosing data-local task task_201208021510_0221_m_000000
2012-08-09 15:03:52,810 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201208021510_0221_m_000000_0'
2012-08-09 15:04:14,603 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201208021510_0221_m_000000_1: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:47)
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

2012-08-09 15:04:17,606 INFO org.apache.hadoop.mapred.JobInProgress: Choosing a failed task task_201208021510_0221_m_000000
2012-08-09 15:04:17,607 INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201208021510_0221_m_000000_2' to tip task_201208021510_0221_m_000000, for tracker 'tracker_localhost:localhost/127.0.0.1:50158'
2012-08-09 15:04:17,607 INFO org.apache.hadoop.mapred.JobInProgress: Choosing data-local task task_201208021510_0221_m_000000
2012-08-09 15:04:17,607 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201208021510_0221_m_000000_1'
2012-08-09 15:04:35,618 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201208021510_0221_m_000000_2: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:47)
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

2012-08-09 15:04:38,621 INFO org.apache.hadoop.mapred.JobInProgress: Choosing a failed task task_201208021510_0221_m_000000
2012-08-09 15:04:38,621 INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201208021510_0221_m_000000_3' to tip task_201208021510_0221_m_000000, for tracker 'tracker_localhost:localhost/127.0.0.1:50158'
2012-08-09 15:04:38,621 INFO org.apache.hadoop.mapred.JobInProgress: Choosing data-local task task_201208021510_0221_m_000000
2012-08-09 15:04:38,621 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201208021510_0221_m_000000_2'
2012-08-09 15:04:56,632 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201208021510_0221_m_000000_3: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:47)
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

2012-08-09 15:04:59,635 INFO org.apache.hadoop.mapred.TaskInProgress: TaskInProgress task_201208021510_0221_m_000000 has failed 4 times.
2012-08-09 15:04:59,635 INFO org.apache.hadoop.mapred.JobInProgress: TaskTracker at 'localhost' turned 'flaky'
2012-08-09 15:04:59,635 INFO org.apache.hadoop.mapred.JobInProgress: Aborting job job_201208021510_0221
2012-08-09 15:04:59,635 INFO org.apache.hadoop.mapred.JobInProgress: Killing job 'job_201208021510_0221'
2012-08-09 15:04:59,635 INFO org.apache.hadoop.mapred.JobTracker: Adding task (JOB_CLEANUP) 'attempt_201208021510_0221_m_000002_0' to tip...
4

1 回答 1

0

不,这绝对意味着某处的数据中有一条坏线。最可能的罪魁祸首是杂散的空白行、标题行、“注释”行或同一目录中的一些杂散文件,如 _SUCCESS。

于 2012-08-09T14:53:49.883 回答