3

I've used a number of pdf-->text methods to extract text from pdf documents. For one particular type of PDF I have, neither pyPDF or pdfMiner are doing a good job extracting the text. However, http://www.convertpdftotext.net/ does it (almost) perfectly.

I discovered that the pdf I'm using has some transparent text in it, and it is getting merged into the other text.

Some examples of the blocks of text I get back are:

12324  35th Ed. 01-MAR-12 Last LNM: 14/12 NAD 83 14/12          Corrective             Object of Corrective                                                                                                                   
ChartTitle: Intracoastal Waterway Sandy Hook to Little Egg Harbor Position  
            C HAActRionT    N  Y  -NJ   - S A  N  D Y    H OO    K  ATcO tionLI T TLE EGG HARBOR.  Page/Side: N/A 
(Temp) indicates that the chart correction action is temporary in nature.  Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true.       
 Bearings RoEf LlighOCtAT seEc tors aSrehre towwsbuardry th Re ivligher Ct fhroanmn seel Lawighartde.d  B Theuoy  5no minal range of lights is expressedf roin mna 4u0tic-24al -mi46les.56 0(NNM ) unless othe0r7w4is-00e n-o05te.d8.8 0 W 
to     40-24-48.585N 074-00-05.967W 

and

12352 33rd Ed. 01-MAR-11 Last LNM: 03/12 NAD 83 04/12 . . l . . . . Corrective Object of Corrective ChartTitle: Shinnecock Bay to East Rockaway Inlet Position C HAActRionT S H IN N E C OC K B A Y TO AcEtionAS T ROCKAWAY INLET. Page/Side: N/A (Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true. (BTeeamringp) s DoEf LlighETtE s ectors aSretat toew Baoratd Ctheh anlighnet lf Droaym beseacoawanr 3d. The nominal range of lights is expressedf roin mna 4u0tic-37al -mi11les.52 0(NNM ) unless othe0r7w3is-29e n-5o3te.d76. 0 W

and I have discovered that the "ghost text" is ALWAYS the following:

 Corrective             Object of Corrective              Position
    Action                         Action

(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are given in degrees clockwise from 000 true.
Bearings of light sectors are toward the light from seaward. The nominal range of lights is expressed in nautical miles (NM) unless otherwise noted.

In the 2nd example I posted, the text I want (with the ghost text removed) is:

12352 33rd Ed. 01-Mar-11 Last LNM:03/12 NAD 83 04/12
Chart Title:Shinnecock Bay to East Rockaway Inlet. Page/Side:N/A 
CGD01
(Temp) DELETE State Boat Channel Daybeacon 3 from 40-37-11.520N 073-29-53.760W

This problem occurs just once per document, and does not appear to be totally consistent (as seen above). I am wondering if one of you wizards could think of a way to remove the ghosted text (I don't need/want it) using python. If I had been using pyPDF, I would have used a regex to rip it out during the conversion to text. Unfortunately, since I'm starting out with a text file from the website listed above, the damage has already been done. I'm at a bit of a loss.

Thanks for reading.

EDIT: The solution to this problem looks like it be more complex than the rest of the application, so I'm going to withdraw my request for help.

I very much appreciate the thought put into it by those who have contributed.

4

2 回答 2

1

鉴于幽灵文本可以以看似不可预测的方式拆分,我认为没有一种简单的自动删除方式不会产生误报。你需要的是几乎人类水平的模式识别。:-)

您可以尝试利用这些消息的格式。大致;

<number> <number>[rn]d Ed. <date> Last LNM:<mm>/<yy> NAD <date2>
Chart Title:<text>. Page/Side:<N/A or number(s)> CGD<number> <text>
<position>

使用它,您可以从可预测的元素中提取废话,然后如果您有图表名称列表(“Shinnecock Bay 到 East Rockaway Inlet”)和描述性词语(如“State”、“Boat”、“Daybeacon”)您可以通过找到两个文本块中的损坏单词与单词列表中的单词之间的最小 levenshtein 距离来重建原始单词。

如果您可以安装poppler软件,您可以尝试使用pdftotext-layout选项以尽可能保留原始 PDF 的格式。这可能会让你的问题消失。

于 2012-04-22T17:14:32.510 回答
1

您可以递归地找到您的模式“纠正位置操作的纠正对象......”可以包含在您的错位文本中的所有可能方式,

然后,您可以为每个可能的路径解开文本,对它们进行某种拼写检查,然后选择拼写错误最少的那个。或者,由于您大致知道每个子字符串应该出现的位置,您可以将其用作启发式方法。或者您可以简单地使用第一条路径。

一些伪代码(未经测试):

 def findPaths(mangledText, pattern, path)
      if len(pattern)==0:  # end of pattern
           return [path]
      else:
           nextLetter= pattern[0]
           locations = findAllOccurences (mangledText, nextLetter) # get all indices in mangledText that contain nextLetter
           allPaths = []
           for loc in locations:
               paths = findPaths( mangledText[loc+1:], pattern[1:], path + (loc,) )
               allPaths.Extend(paths)
           return allPaths # if no locations for the next letters exist, allPaths will be emtpy

然后你可以这样调用它(可选地从搜索模式中删除所有空格,除非你确定它们都包含在损坏的文本中)

  allPossiblePaths = findPaths ( YourMangledText, "Corrective Object...", () )

那么 allPossiblePaths 应该包含您的模式可以包含在您的损坏文本中的所有可能方式的列表。每个条目都是一个与模式长度相同的元组,包含模式的相应字母在搜索文本中出现的索引。

于 2012-04-22T20:02:54.290 回答