2

I have a fully functional UIMA job that does simple annotation. I can successfully launch it via my local CAS GUI.

I have been trying to run the UIMA job on Hadoop using Apache Behemoth. I am wondering if someone has worked on this? The job runs successfully but in the hadoop output directory; there is no output from the UIMA job. I can see in the Hadoop job tracker output that the job completed successfully and it copied its input data to the final output directory.

Can someone point me to what could be going on here, and is there any additional changes we need to make in our UIMA code?

Thanks

4

2 回答 2

1

以下是我整理的适用于小型管道的步骤:

  • 将 UIMA 管道导出为 jar (Your-pipeline.jar)
  • 复制到 HDFS
  • 生成 Behemoth Corpus(** 记住以下所有路径都是 hdfs 路径**)
    hadoop jar tika/target/behemoth-tika-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.tika.TikaDriver -i /user/blah/ -o /user/blah/
    
  • 使用您的管道进行处理
    hadoop jar uima/target/behemoth-uima-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.uima.UIMADriver /user/blah/ /user/blah/ /apps/Your-pipeline.pear
  • 列出注释:
    hadoop jar uima/target/behemoth-uima-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusReader -i -a /user/blah/
    
  • 将注释转换为文本:
    hadoop jar uima/target/behemoth-uima-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.uima.UIMABin2TxtConverter -a -i /user/blah/ -o /user/blah/
    
于 2013-12-19T16:25:35.797 回答
0

试试这个场景:

1) Gererate Behemoth 语料库

2) 在 Behemoth 语料库上运行 Tika 作业 => Tika 语料库

3) 在 Tika 语料库上运行 UIMA 作业 => UIMA 语料库

4) 通过 Behemoth 的 CorpusReader 使用 -a 选项查看 UIMA 输出语料库 - 它显示您在 /hadoop/conf 中的 behemoth-site.xml 中定义的已创建 UIMA 注释。

但是我不知道如何从 Behemoth(UIMA) 语料库中提取有根据的注释的问题。

我也有 CAS 消费者(在 PEAR 文件中),它应该将 UIMA 注释写入本地文件系统(不在 HDFS 中)上的文件,但我在我的文件系统上没有找到这个文件((

于 2012-10-05T18:54:15.857 回答