3

I have a large number of pdfs in different formats. Among other things, I need to extract their titles (not the document name, but a title in the text). Due to the range of formats, the titles are not in the same locations in the pdfs. Further, some of the pdfs are actually scanned images (I need to use OCR/Optical Character Recognition on them). The titles are sometimes one line, sometimes 2. They do not tend to have the same set of words. In the range of physical locations the titles usually show up, there are often other words (ie if doc 1 has title 1 at x1, y1, doc 2 might have title 2 at x2, y2 but have other non-title text at x1 y1). Further, there are some very rare cases where the pdfs don't have a title.

So far I can use pdftotext to extract text within a given bounding box, and convert it to a text file. If there's a title, this lets me capture the title, but often with other extraneous words included. This also only works on non-image pdfs. I'm wondering if a) There's a good way to identify the title from among all the words I extract for a document (because there are often extraneous words), ideally with a good way to identify that no title exists, and b) if there are any tools that are equivalent to pdftotext that will also work on scanned images (I do have an ocr script working, but it does ocr over an entire image rather than a section of one).

One method that somewhat answers the title dilemma is to extract the words in the bounding box, use the rest of the document to identify which of the bounding box words are keywords for the document, and construct the title from the keywords. This wouldn't extract the actual title, but may give words that could construct a reasonable alternative. I'm already extracting keywords for other parts of the project, but I would definitely prefer to extract the actual title as people may be using the verbatim title for lookup purposes.

Further note if it wasn't clear - I'm trying to do this programatically with open source/free tools, ideally in Python, and I will have a large number of documents (10,000+).

4

2 回答 2

1

您可以利用单词字体大小信息来提取标题单词。根据您的问题,我在这里理解的是我建议提取标题词的内容:

使用任何开源模块(例如pdf2image )将 pdf 文档转换为图像,然后使用tesseract进行 OCR。从 OCR 输出中,您可以获得文本数据及其尺寸信息,即。单个单词的宽度和高度。

对单词的高度做一些统计分析(直方图),看看是否可以使用高度分布来识别标题词。您可以使用基于启发式信息的固定阈值,也可以使用基于高度分布的一些自适应阈值并使用此阈值来识别标题词。

于 2019-03-27T13:30:53.813 回答
0

对于以后遇到这个问题的人,我将提供我决定做的事情的快速更新(尽管我没有测试准确性,所以我不知道这种方法是否真的有用)。

我将使用的总体方法是通过神经网络进行机器学习(一旦我拥有它,我会报告准确性)。我实际上是在获取文档的前 200 个单词,并生成 4-20 个连续单词的 n-grams(所以 ~16*200 n-grams 的单词;4 bc 我的标题都不短,20 个相同但更长) . 然后,我从每个 n-gram 生成一个唯一的特征向量,我决定使用的特征部分取决于我的文本,但有些更通用,例如“n-gram 第一个单词的首字母大写吗?”。知道正确的标题,我可以把它们变成一个等价的向量。因此,如果 vec(n_gram) = vec(correct_title) 则输出 1,否则输出 0。我正在使用它来训练 ML 模型。目前这不能解决我的扫描图像 pdf 的问题,除非他们' re首先转换成文本文件。它还假设当 pdf 转换为 n-gram 时,在标题词中保留了词序。我注意到非标题单词的顺序并不总是通过转换保留,但这是一个非常罕见的问题,并且似乎只在有换行符然后整行不合适时才会发生(所以它不应该影响标题希望)。

于 2019-03-22T18:07:39.513 回答