2

我正在寻找一种使用猪保存文本模式的方法。假设我有如下输入:

ae988852ed9eabe3b5298d8b4c3b652e    I Never In My Life Gave A Guy No Money For Gas Or Food besides That Simpson Guy SMH I Fault Myself Though

从这些数据中,我想提取连续的单词模式并将其保存到一个包中。例如,{i, never} 将是第一个,{never, in} 将是第二个,依此类推。我知道我会以类似的方式启动程序:

myinput = LOAD '/user/hive/warehouse/twitter_raw/$date' USING PigStorage('\t') AS (id,  mess);
strings = FOREACH myinput GENERATE $0 AS id, LOWER($1) AS mess;

但是下一步会是什么?

4

1 回答 1

1

也许人们可以通过一种棘手的方式仅使用内置函数来获得结果,但一个简单的 UDF 也可以完成这项工作:

public class SlidingTuple extends EvalFunc<DataBag> {

    private static final BagFactory bagFactory = BagFactory.getInstance();
    private static final TupleFactory tupleFactory = TupleFactory.getInstance();

    @Override
    public DataBag exec(Tuple input) throws IOException {
        try {
            DataBag inputBag = (DataBag) input.get(0);
            DataBag result = null;
            if (inputBag != null) {
                result = bagFactory.newDefaultBag();
                Iterator<Tuple> it = inputBag.iterator();
                Tuple previous = it.next();
                while (it.hasNext()) {
                    Tuple current = it.next();
                    Tuple tuple = tupleFactory.newTuple(2);
                    tuple.set(0, previous.get(0));
                    tuple.set(1, current.get(0));
                    result.add(tuple);
                    previous = current;
                }
            }
            return result;
        }
        catch (Exception e) {
            throw new RuntimeException("SlidingTuple error", e);
        }
    }
}

然后:

A = LOAD '/user/hive/warehouse/twitter_raw/$date' USING PigStorage('\t') 
      AS (id:chararray,  mess:chararray);

B = foreach A generate TOKENIZE(mess, ' ') as words;

然后使用您的自定义 UDF:

C = foreach B generate com.example.SlidingTuple(words);
于 2012-09-22T12:35:41.847 回答