1

几天来,我一直试图让 Camelot 在 pdf 页面的特定区域上工作,但这一直让我感到困惑。我查看并尝试了文档建议、一些错误报告和这个 SO 问题,但无济于事。我可以使用一些帮助。

我从文档中举了一个例子,因为它有不止一张桌子,这张。我修改了原始命令以仅提取两个表中的一个,来自:

tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text=' .\n')

到:

tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')

然而:

  • 我更改了正则表达式,因为它消除了单词之间的空格,
  • 使用table_area而不是文档,table_areas因为前者触发了详细说明,而第二个是错误(这里解释了错误,文档似乎仍然是错误的)
  • 尝试提取两个表并使用camelot的绘图功能检查各个区域,如此处文档中所述,因此它们应该是正确的,
  • 也试过使用table_regions,至少它拉出一张桌子而不是两张,但它仍然相当不准确(见下面的评论)

所以这里是我在上面提到的pdf上的试验结果:

第一个:table_area'35,591,385,343'PDF区域(顶表)上使用

>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
                                                    0                                                  1         2         3         4         5         6         7         8         9
0   Program. Represents arrests reported (not char...                                                                                                                                   
1   by the FBI. Some persons may be arrested more ...                                                                                                                                   
2   could represent multiple arrests of the same p...                                                                                                                                   
3                                                                                                            Total                          Male                        Female          
4                                     Offense charged                                                     Under 18  18 years            Under 18  18 years            Under 18  18 years
5                                                                                                  Total     years  and over     Total     years  and over     Total     years  and over
6   Total   . . .  .  .  .  .  . .  . .  . .  . . ...                                          11,062 .6  1,540 .0  9,522 .6  8,263 .3  1,071 .6  7,191 .7  2,799 .2    468 .3  2,330 .9
7   Violent crime   .  .  .  .  .  .  .  . .  . . ...                                             467 .9     69 .1    398 .8    380 .2     56 .5    323 .7     87 .7     12 .6     75 .2
8                             Murder and nonnegligent                                                                                                                                   
9           manslaughter . . . . . . . .. .. .. .. ..                                               10.0       0.9       9.1       9.0       0.9       8.1       1.1         –       1.0
10       Forcible rape . . . . . . . .. .. .. .. .. .                                               17.5       2.6      14.9      17.2       2.5      14.7         –         –         –
11         Robbery . . . .. .. . .. . ... . ... . ...                                              102.1      25.5      76.6      90.0      22.9      67.1      12.1       2.5       9.5
....
34       Disorderly conduct . .. . . . . . .. .. .. .                                              529.5     136.1     393.3     387.1      90.8     296.2     142.4      45.3      97.1
35          Vagrancy . . . .. . . . ... .... .... ...                                               26.6       2.2      24.4      20.9       1.6      19.3       5.7       0.6       5.1
36         All other offenses (except traffic) . . ..                                              306.1     263.4   2,800.8   2,337.1     194.2   2,142.9     727.0      69.2     657.9
37      Suspicion . . . .. . . .. .. .. .. .. .. . ..                                                1.6         –       1.4       1.2         –       1.0         –         –         –
38            Curfew and loitering law violations  ..                                               91.0      91.0       (X)      63.1      63.1       (X)      28.0      28.0       (X)
39        Runaways  . . . . . . . .. .. .. .. .. ....                                               75.8      75.8       (X)      34.0      34.0       (X)      41.8      41.8       (X)
40                                                     – Represents zero. X Not applicable. 1 Buying,...

注意表格是两个,它在顶部和底部都包含不需要的文本,这些文本不应该在使用 选择的区域内plot()

二:table_regions在同一个'35,591,385,343'PDF区域,顶表上使用

>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
                                                    0                                                  1         2         3         4         5         6         7         8         9
0   Program. Represents arrests reported (not char...                                                                                                                                   
1   by the FBI. Some persons may be arrested more ...                                                                                                                                   
2   could represent multiple arrests of the same p...                                                                                                                                   
3                                                                                                            Total                          Male                        Female          
4                                     Offense charged                                                     Under 18  18 years            Under 18  18 years            Under 18  18 years
5                                                                                                  Total     years  and over     Total     years  and over     Total     years  and over
6   Total   . . .  .  .  .  .  . .  . .  . .  . . ...                                          11,062 .6  1,540 .0  9,522 .6  8,263 .3  1,071 .6  7,191 .7  2,799 .2    468 .3  2,330 .9
7   Violent crime   .  .  .  .  .  .  .  . .  . . ...                                             467 .9     69 .1    398 .8    380 .2     56 .5    323 .7     87 .7     12 .6     75 .2
8                             Murder and nonnegligent                                                                                                                                   
9           manslaughter . . . . . . . .. .. .. .. ..                                               10.0       0.9       9.1       9.0       0.9       8.1       1.1         –       1.0
10       Forcible rape . . . . . . . .. .. .. .. .. .                                               17.5       2.6      14.9      17.2       2.5      14.7         –         –         –
11         Robbery . . . .. .. . .. . ... . ... . ...                                              102.1      25.5      76.6      90.0      22.9      67.1      12.1       2.5       9.5
....
34       Disorderly conduct . .. . . . . . .. .. .. .                                              529.5     136.1     393.3     387.1      90.8     296.2     142.4      45.3      97.1
35          Vagrancy . . . .. . . . ... .... .... ...                                               26.6       2.2      24.4      20.9       1.6      19.3       5.7       0.6       5.1
36         All other offenses (except traffic) . . ..                                              306.1     263.4   2,800.8   2,337.1     194.2   2,142.9     727.0      69.2     657.9
37      Suspicion . . . .. . . .. .. .. .. .. .. . ..                                                1.6         –       1.4       1.2         –       1.0         –         –         –
38            Curfew and loitering law violations  ..                                               91.0      91.0       (X)      63.1      63.1       (X)      28.0      28.0       (X)
39        Runaways  . . . . . . . .. .. .. .. .. ....                                               75.8      75.8       (X)      34.0      34.0       (X)      41.8      41.8       (X)
40                                                     – Represents zero. X Not applicable. 1 Buying,... 

显然,只有一张表,在选定区域之外出现不需要的文本的问题。

第三:table_area'33,297,386,65'PDF区域上使用(底部表格)

>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
                                                    0                                                  1         2         3         4         5         6         7         8         9
0   Program. Represents arrests reported (not char...                                                                                                                                   
1   by the FBI. Some persons may be arrested more ...                                                                                                                                   
2   could represent multiple arrests of the same p...                                                                                                                                   
3                                                                                                            Total                          Male                        Female          
4                                     Offense charged                                                     Under 18  18 years            Under 18  18 years            Under 18  18 years
5                                                                                                  Total     years  and over     Total     years  and over     Total     years  and over
6   Total   . . .  .  .  .  .  . .  . .  . .  . . ...                                          11,062 .6  1,540 .0  9,522 .6  8,263 .3  1,071 .6  7,191 .7  2,799 .2    468 .3  2,330 .9
7   Violent crime   .  .  .  .  .  .  .  . .  . . ...                                             467 .9     69 .1    398 .8    380 .2     56 .5    323 .7     87 .7     12 .6     75 .2
8                             Murder and nonnegligent                                                                                                                                   
9           manslaughter . . . . . . . .. .. .. .. ..                                               10.0       0.9       9.1       9.0       0.9       8.1       1.1         –       1.0
10       Forcible rape . . . . . . . .. .. .. .. .. .                                               17.5       2.6      14.9      17.2       2.5      14.7         –         –         –
11         Robbery . . . .. .. . .. . ... . ... . ...                                              102.1      25.5      76.6      90.0      22.9      67.1      12.1       2.5       9.5
....
34       Disorderly conduct . .. . . . . . .. .. .. .                                              529.5     136.1     393.3     387.1      90.8     296.2     142.4      45.3      97.1
35          Vagrancy . . . .. . . . ... .... .... ...                                               26.6       2.2      24.4      20.9       1.6      19.3       5.7       0.6       5.1
36         All other offenses (except traffic) . . ..                                              306.1     263.4   2,800.8   2,337.1     194.2   2,142.9     727.0      69.2     657.9
37      Suspicion . . . .. . . .. .. .. .. .. .. . ..                                                1.6         –       1.4       1.2         –       1.0         –         –         –
38            Curfew and loitering law violations  ..                                               91.0      91.0       (X)      63.1      63.1       (X)      28.0      28.0       (X)
39        Runaways  . . . . . . . .. .. .. .. .. ....                                               75.8      75.8       (X)      34.0      34.0       (X)      41.8      41.8       (X)
40                                                     – Represents zero. X Not applicable. 1 Buying,...

它拿起了两张桌子,显然第一张仍然是第一张。不需要的文本也有同样的问题,但现在是预期的。

第四:table_regions'33,297,386,65'PDF区域上使用(底部表格)

>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
                                                    0           1          2          3               4              5
0                    Table 325. Arrests by Race: 2009                                                                 
1   [Based on Uniform Crime Reporting (UCR) Progra...                                                                 
2   with a total population of 239,839,971 as esti...                                                                 
3                                                                                              American               
4                                     Offense charged                                    Indian/Alaskan  Asian Pacific
5                                                           Total      White      Black          Native       Islander
6   Total  . . . . .  . .  .  . .  .  . . .  .  . ...  10,690,561  7,389,208  3,027,153         150,544        123,656
7   Violent crime   .  .  .  .  .  .  .  . .  . . ...     456,965    268,346    177,766           5,608          5,245
8     Murder and nonnegligent manslaughter . .. ... .       9,739      4,741      4,801             100             97
9   Forcible rape . . . . . . . .. .. .. .. .... ....      16,362     10,644      5,319             169            230
10  Robbery . . . . .. . . . ... . ... . .... .......     100,496     43,039     55,742             726            989
11  Aggravated assault  . . . . . . . .. .. .........     330,368    209,922    111,904           4,613          3,929
....
34  All other offenses (except traffic) . .. .. .....   2,929,217  1,937,221    911,670          43,880         36,446
35  Suspicion . . .. . . . .. .. .. .. .. .. .. .....       1,513        677        828               1              7
36  Curfew and loitering law violations  . .. ... ...      89,578     54,439     33,207             872          1,060
37  Runaways  . . . . . . . .. .. .. .. .. .. .......      73,616     48,343     19,670           1,653          3,950
38           1 Except forcible rape and prostitution.

更好,但它会像上面那样拾取不需要的文本。

我真的很重视建议或指示。提前致谢!

4

1 回答 1

1

table_areas(不是 table_area)关键字参数效果很好,应该使用(我使用 Camelot 0.7.3)。

tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_areas=['35,591,385,343'], pages = '1')

返回:

在此处输入图像描述

这似乎是正确的。

于 2020-03-02T09:05:10.270 回答