1

我正在尝试在实际集群中实现HotonWorks 的数据管道示例。我的集群中安装了 HDP 2.2 版本,但在进程和数据集选项卡的 UI 中出现以下错误

Failed to load data. Error: 400 Bad Request

除了 HBase、Kafka、Knox、Ranger、Slider 和 Spark 之外,我的所有服务都在运行。

我已阅读描述集群、提要和流程定义的各个标签的falcon 实体规范,并修改了提要和流程的 xml 配置文件,如下所示

集群定义

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="primaryCluster" description="Analytics1" colo="Bangalore" xmlns="uri:falcon:cluster:0.1">
    <interfaces>
        <interface type="readonly" endpoint="hftp://node3.com.analytics:50070" version="2.6.0"/>
        <interface type="write" endpoint="hdfs://node3.com.analytics:8020" version="2.6.0"/>
        <interface type="execute" endpoint="node1.com.analytics:8050" version="2.6.0"/>
        <interface type="workflow" endpoint="http://node1.com.analytics:11000/oozie/" version="4.1.0"/>
        <interface type="messaging" endpoint="tcp://node1.com.analytics:61616?daemon=true" version="5.1.6"/>
    </interfaces>
    <locations>
        <location name="staging" path="/user/falcon/primaryCluster/staging"/>
        <location name="working" path="/user/falcon/primaryCluster/working"/>
    </locations>
    <ACL owner="falcon" group="hadoop"/>
</cluster>

提要定义

原始电子邮件提要

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="rawEmailFeed" description="Raw customer email feed" xmlns="uri:falcon:feed:0.1">
    <tags>externalSystem=USWestEmailServers,classification=secure</tags>
    <groups>churnAnalysisDataPipeline</groups>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <late-arrival cut-off="hours(4)"/>
    <clusters>
        <cluster name="primaryCluster" type="source">
            <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
            <retention limit="days(3)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/user/falcon/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
        <location type="stats" path="/none"/>
        <location type="meta" path="/none"/>
    </locations>
    <ACL owner="falcon" group="users" permission="0755"/>
    <schema location="/none" provider="none"/>
</feed>

已清理电子邮件Feed

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="cleansedEmailFeed" description="Cleansed customer emails" xmlns="uri:falcon:feed:0.1">
    <tags>owner=USMarketing,classification=Secure,externalSource=USProdEmailServers,externalTarget=BITools</tags>
    <groups>churnAnalysisDataPipeline</groups>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <clusters>
        <cluster name="primaryCluster" type="source">
            <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
            <retention limit="days(10)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/user/falcon/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
    </locations>
    <ACL owner="falcon" group="users" permission="0755"/>
    <schema location="/none" provider="none"/>
</feed>

流程定义

原始电子邮件摄取过程

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="rawEmailIngestProcess" xmlns="uri:falcon:process:0.1">
    <tags>pipeline=churnAnalysisDataPipeline,owner=ETLGroup,externalSystem=USWestEmailServers</tags>
    <clusters>
        <cluster name="primaryCluster">
            <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
        </cluster>
    </clusters>
    <parallel>1</parallel>
    <order>FIFO</order>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <outputs>
        <output name="output" feed="rawEmailFeed" instance="now(0,0)"/>
    </outputs>
    <workflow name="emailIngestWorkflow" version="2.0.0" engine="oozie" path="/user/falcon/apps/ingest/fs"/>
    <retry policy="periodic" delay="minutes(15)" attempts="3"/>
    <ACL owner="falcon" group="hadoop"/>
</process>

清理电子邮件进程

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="cleanseEmailProcess" xmlns="uri:falcon:process:0.1">
    <tags>pipeline=churnAnalysisDataPipeline,owner=ETLGroup</tags>
    <clusters>
        <cluster name="primaryCluster">
            <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
        </cluster>
    </clusters>
    <parallel>1</parallel>
    <order>FIFO</order>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <inputs>
        <input name="input" feed="rawEmailFeed" start="now(0,0)" end="now(0,0)"/>
    </inputs>
    <outputs>
        <output name="output" feed="cleansedEmailFeed" instance="now(0,0)"/>
    </outputs>
    <workflow name="emailCleanseWorkflow" version="5.0" engine="pig" path="/user/falcon/apps/pig/id.pig"/>
    <retry policy="periodic" delay="minutes(15)" attempts="3"/>
    <ACL owner="falcon" group="hadoop"/>
</process>

我没有对 ingest.sh、workflow.xml 和 id.pig 文件进行任何更改。它们存在于 hdfs 位置 /user/falcon/apps/ingest/fs(ingest.sh 和 workflow.xml)和 /user/falcon/apps/pig(id.pig)。此外,我不确定是否需要隐藏的 .DS_Store 文件,因此没有将它们包含在上述 hdfs 位置中。

摄取.sh

#!/bin/bash
# curl -sS http://sandbox.hortonworks.com:15000/static/wiki-data.tar.gz | tar xz && hadoop fs -mkdir -p $1 && hadoop fs -put wiki-data/*.txt $1
curl -sS http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz | tar xz && hadoop fs -mkdir -p $1 && hadoop fs -put enron_with_categories/*/*.txt $1

工作流.xml

<!--
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf">
    <start to="shell-node"/>
    <action name="shell-node">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>ingest.sh</exec>
            <argument>${feedInstancePaths}</argument>
            <file>${wf:appPath()}/ingest.sh#ingest.sh</file>
            <!-- <file>/tmp/ingest.sh#ingest.sh</file> -->
            <!-- <capture-output/> -->
        </shell>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

id.猪

A = load '$input' using PigStorage(',');
B = foreach A generate $0 as id;
store B into '$output' USING PigStorage();

我不确定HDP 示例的流程是如何发生的,如果有人能澄清这一点,我将不胜感激。

具体来说,我不明白提供给 ingest.sh 的参数 $1 的来源。我相信它是存储传入数据的 hdfs 位置。我注意到 workflow.xml 有标签<argument>${feedInstancePaths}</argument>

feedInstancePaths 从哪里获得它的值?我想我收到了错误,因为提要没有存储在正确的位置。但这可能是一个不同的问题。

用户 Falcon 对 /user/falcon 下的所有 hdfs 目录也有 755 权限

任何帮助和建议将不胜感激。

4

1 回答 1

0

您正在运行自己的集群,但本教程需要在 shellscript (ingest.sh) 中分配的资源:

curl -sS http://sandbox.hortonworks.com:15000/static/wiki-data.tar.gz

我猜你的集群在 sandbox.hortonworks.com 上没有得到解决,而且你没有所需的资源 wiki-data.tar.gz。本教程仅适用于提供的沙箱。

于 2017-02-15T10:58:23.113 回答