I'm looking for a solution for PDF to HTML and OCR service in the cloud or in the SDK format. After my searches, I see that there are bunch of services out there in the internet. I tried some of them and I got some idea. I'd like to know that if any of you use such service.
My biggest concerns are to have a automation structure to have an HTML output that I can use in the information extraction. I'd like to have structured data output like tables. (most of the services provide HTML output with the -character format (CSS/HTML tag for each char) or -paragraph format (CSS/HTML for each line).
I checked so far :
- Abbyy Cloud SDK (They don't have PDF-to-HTML service but PDF-to-XML that can be covertable to HTML with XSLT support (maybe). Also OCR service with text output is quite good)
- cloudconvert.org (They are providing the same results as Ubuntu pdftohtml command which is based on poppler-Xpdf3.0)
- pdftohtml commamd (Tested on Ubuntu) - I got a result with full of < p >.
- aspose.PDF (They don't have PDF-to-HTML service in the cloud but they have good integration with GDrive, Dropbox and Amazon s3.
- PdfNET of PDFTron : I got a result with complex CSS and HTML structure with almost a tag per character.
My question is if you know any other service worth to try and get structural HTML output for data extraction.
Thanks in advance.