amazon-ec2 - tesseract-ocr works on EC2, not lambda

Question

My goal is to run tesseract-ocr in AWS Lambda.

I've built an EC2 instance that attempts to mirror the Lambda environment. Executing tesseract without parameters succeeds in both environments. However, any attempt at substantive image processing, e.g. this code:

tess = child_process.exec('tesseract input.tif output -l eng -psm 1 hocr', function(error, stdout, stderr) {
...

runs successfully on my EC2 box, but fails in Lambda with this error:

Error: Command failed: Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Error during processing.

 at ChildProcess.exithandler (child_process.js:648:15)
 at ChildProcess.emit (events.js:98:17)
 at maybeClose (child_process.js:756:16)
 at Process.ChildProcess._handle.onexit (child_process.js:823:5)
Error code: 1
Signal received: null

Lambda is assuming an IAM role with administrative privileges ({ "Effect": "Allow", "Action": "", "Resource": "" })

The "Error during processing" error is emitted by tesseract as a top level catch-all. I'm going to instrument tesseract and try to narrow the problem further.

How I got here:

My EC2 machine is a t2.micro running Amazon Linux in us-east-1 (amzn-ami-hvm-2014.09.2.x86_64-ebs (ami-146e2a7c)).
I installed node 0.10.33 and aws-sdk@2.0.23, which match the Lambda versions.
I compiled tesseract and leptonica from source. Added an rpath and have run ldd to confirm that all dependencies are found
tesseract binaries and liblept.so are all in my root directory (/var/task)

I'd like to know what's going wrong - or how to diagnose it.

Thank you, Dave

score 3 · Accepted Answer

简短的回答：输出必须放在 /tmp 目录中，例如

tesseract input.tif /tmp/output -l eng -psm 1 hocr

稍微长一点的答案：tesseract 在后台调用 fopen wb，显然 /var/task 中禁止这样做。

几天前我会注意到这一点，但 Lambda 并没有传播我的部署包。因此，有一次我尝试将输出放在 /tmp 目录中，但没有任何效果 - 但那是 b/c Lambda 正在执行我的函数的陈旧版本。解决方案是在调用更新函数之前始终删除函数。

amazon-ec2 - tesseract-ocr works on EC2, not lambda

1 回答 1

Related

Reference