我目前正在使用 Tika 从上传到我在 AWS Elastic Beanstalk 上运行的 Rails 应用程序(运行 Ruby 2.2 的 64 位 Amazon Linux 2016.03 v2.1.2)上的文件中提取文本。我也想索引扫描的图像,所以我需要安装 Tesseract。
我可以通过像这样从源代码安装它来让它工作,但它增加了 10 分钟的部署到新实例的时间。有没有更快的方法来做到这一点?
.ebextensions/02-tesseract.config
packages:
yum:
autoconf: []
automake: []
libtool: []
libpng-devel: []
libtiff-devel: []
zlib-devel: []
container_commands:
01-command:
command: mkdir -p install
cwd: /home/ec2-user
02-command:
command: cp .ebextensions/scripts/install_tesseract.sh /home/ec2-user/install/
03-command:
command: bash install/install_tesseract.sh
cwd: /home/ec2-user
.ebextensions/scripts/install_tesseract.sh
#!/usr/bin/env bash
cd_to_install () {
cd /home/ec2-user/install
}
cd_to () {
cd /home/ec2-user/install/$1
}
if ! [ -x "$(command -v tesseract)" ]; then
# Add `usr/local/bin` to PATH
echo 'pathmunge /usr/local/bin' > /etc/profile.d/usr_local.sh
chmod +x /etc/profile.d/usr_local.sh
# Install leptonica
cd_to_install
wget http://www.leptonica.org/source/leptonica-1.73.tar.gz
tar -zxvf leptonica-1.73.tar.gz
cd_to leptonica-1.73
./configure
make
make install
rm -rf /home/ec2-user/install/leptonica-1.73.tar.gz
rm -rf /home/ec2-user/install/leptonica-1.73
# Install tesseract ~ the jewel of Odin's treasure room
cd_to_install
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
tar -zxvf 3.04.01.tar.gz
cd_to tesseract-3.04.01
./autogen.sh
./configure
make
make install
ldconfig
rm -rf /home/ec2-user/install/3.04.01.tar.gz
rm -rf /home/ec2-user/install/tesseract-3.04.01
# Install tessdata
cd_to_install
wget https://github.com/tesseract-ocr/tessdata/archive/3.04.00.tar.gz
tar -zxvf 3.04.00.tar.gz
cp /home/ec2-user/install/tessdata-3.04.00/eng.* /usr/local/share/tessdata/
rm -rf /home/ec2-user/install/3.04.00.tar.gz
rm -rf /home/ec2-user/install/tessdata-3.04.00
fi