几天来,我一直试图让 Camelot 在 pdf 页面的特定区域上工作,但这一直让我感到困惑。我查看并尝试了文档建议、一些错误报告和这个 SO 问题,但无济于事。我可以使用一些帮助。
我从文档中举了一个例子,因为它有不止一张桌子,这张。我修改了原始命令以仅提取两个表中的一个,来自:
tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text=' .\n')
到:
tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')
然而:
- 我更改了正则表达式,因为它消除了单词之间的空格,
- 使用
table_area
而不是文档,table_areas
因为前者触发了详细说明,而第二个是错误(这里解释了错误,文档似乎仍然是错误的) - 尝试提取两个表并使用camelot的绘图功能检查各个区域,如此处文档中所述,因此它们应该是正确的,
- 也试过使用
table_regions
,至少它拉出一张桌子而不是两张,但它仍然相当不准确(见下面的评论)
所以这里是我在上面提到的pdf上的试验结果:
第一个:table_area
在'35,591,385,343'
PDF区域(顶表)上使用
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
0 1 2 3 4 5 6 7 8 9
0 Program. Represents arrests reported (not char...
1 by the FBI. Some persons may be arrested more ...
2 could represent multiple arrests of the same p...
3 Total Male Female
4 Offense charged Under 18 18 years Under 18 18 years Under 18 18 years
5 Total years and over Total years and over Total years and over
6 Total . . . . . . . . . . . . . . . ... 11,062 .6 1,540 .0 9,522 .6 8,263 .3 1,071 .6 7,191 .7 2,799 .2 468 .3 2,330 .9
7 Violent crime . . . . . . . . . . . ... 467 .9 69 .1 398 .8 380 .2 56 .5 323 .7 87 .7 12 .6 75 .2
8 Murder and nonnegligent
9 manslaughter . . . . . . . .. .. .. .. .. 10.0 0.9 9.1 9.0 0.9 8.1 1.1 – 1.0
10 Forcible rape . . . . . . . .. .. .. .. .. . 17.5 2.6 14.9 17.2 2.5 14.7 – – –
11 Robbery . . . .. .. . .. . ... . ... . ... 102.1 25.5 76.6 90.0 22.9 67.1 12.1 2.5 9.5
....
34 Disorderly conduct . .. . . . . . .. .. .. . 529.5 136.1 393.3 387.1 90.8 296.2 142.4 45.3 97.1
35 Vagrancy . . . .. . . . ... .... .... ... 26.6 2.2 24.4 20.9 1.6 19.3 5.7 0.6 5.1
36 All other offenses (except traffic) . . .. 306.1 263.4 2,800.8 2,337.1 194.2 2,142.9 727.0 69.2 657.9
37 Suspicion . . . .. . . .. .. .. .. .. .. . .. 1.6 – 1.4 1.2 – 1.0 – – –
38 Curfew and loitering law violations .. 91.0 91.0 (X) 63.1 63.1 (X) 28.0 28.0 (X)
39 Runaways . . . . . . . .. .. .. .. .. .... 75.8 75.8 (X) 34.0 34.0 (X) 41.8 41.8 (X)
40 – Represents zero. X Not applicable. 1 Buying,...
注意表格是两个,它在顶部和底部都包含不需要的文本,这些文本不应该在使用 选择的区域内plot()
。
二:table_regions
在同一个'35,591,385,343'
PDF区域,顶表上使用
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
0 1 2 3 4 5 6 7 8 9
0 Program. Represents arrests reported (not char...
1 by the FBI. Some persons may be arrested more ...
2 could represent multiple arrests of the same p...
3 Total Male Female
4 Offense charged Under 18 18 years Under 18 18 years Under 18 18 years
5 Total years and over Total years and over Total years and over
6 Total . . . . . . . . . . . . . . . ... 11,062 .6 1,540 .0 9,522 .6 8,263 .3 1,071 .6 7,191 .7 2,799 .2 468 .3 2,330 .9
7 Violent crime . . . . . . . . . . . ... 467 .9 69 .1 398 .8 380 .2 56 .5 323 .7 87 .7 12 .6 75 .2
8 Murder and nonnegligent
9 manslaughter . . . . . . . .. .. .. .. .. 10.0 0.9 9.1 9.0 0.9 8.1 1.1 – 1.0
10 Forcible rape . . . . . . . .. .. .. .. .. . 17.5 2.6 14.9 17.2 2.5 14.7 – – –
11 Robbery . . . .. .. . .. . ... . ... . ... 102.1 25.5 76.6 90.0 22.9 67.1 12.1 2.5 9.5
....
34 Disorderly conduct . .. . . . . . .. .. .. . 529.5 136.1 393.3 387.1 90.8 296.2 142.4 45.3 97.1
35 Vagrancy . . . .. . . . ... .... .... ... 26.6 2.2 24.4 20.9 1.6 19.3 5.7 0.6 5.1
36 All other offenses (except traffic) . . .. 306.1 263.4 2,800.8 2,337.1 194.2 2,142.9 727.0 69.2 657.9
37 Suspicion . . . .. . . .. .. .. .. .. .. . .. 1.6 – 1.4 1.2 – 1.0 – – –
38 Curfew and loitering law violations .. 91.0 91.0 (X) 63.1 63.1 (X) 28.0 28.0 (X)
39 Runaways . . . . . . . .. .. .. .. .. .... 75.8 75.8 (X) 34.0 34.0 (X) 41.8 41.8 (X)
40 – Represents zero. X Not applicable. 1 Buying,...
显然,只有一张表,在选定区域之外出现不需要的文本的问题。
第三:table_area
在'33,297,386,65'
PDF区域上使用(底部表格)
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
0 1 2 3 4 5 6 7 8 9
0 Program. Represents arrests reported (not char...
1 by the FBI. Some persons may be arrested more ...
2 could represent multiple arrests of the same p...
3 Total Male Female
4 Offense charged Under 18 18 years Under 18 18 years Under 18 18 years
5 Total years and over Total years and over Total years and over
6 Total . . . . . . . . . . . . . . . ... 11,062 .6 1,540 .0 9,522 .6 8,263 .3 1,071 .6 7,191 .7 2,799 .2 468 .3 2,330 .9
7 Violent crime . . . . . . . . . . . ... 467 .9 69 .1 398 .8 380 .2 56 .5 323 .7 87 .7 12 .6 75 .2
8 Murder and nonnegligent
9 manslaughter . . . . . . . .. .. .. .. .. 10.0 0.9 9.1 9.0 0.9 8.1 1.1 – 1.0
10 Forcible rape . . . . . . . .. .. .. .. .. . 17.5 2.6 14.9 17.2 2.5 14.7 – – –
11 Robbery . . . .. .. . .. . ... . ... . ... 102.1 25.5 76.6 90.0 22.9 67.1 12.1 2.5 9.5
....
34 Disorderly conduct . .. . . . . . .. .. .. . 529.5 136.1 393.3 387.1 90.8 296.2 142.4 45.3 97.1
35 Vagrancy . . . .. . . . ... .... .... ... 26.6 2.2 24.4 20.9 1.6 19.3 5.7 0.6 5.1
36 All other offenses (except traffic) . . .. 306.1 263.4 2,800.8 2,337.1 194.2 2,142.9 727.0 69.2 657.9
37 Suspicion . . . .. . . .. .. .. .. .. .. . .. 1.6 – 1.4 1.2 – 1.0 – – –
38 Curfew and loitering law violations .. 91.0 91.0 (X) 63.1 63.1 (X) 28.0 28.0 (X)
39 Runaways . . . . . . . .. .. .. .. .. .... 75.8 75.8 (X) 34.0 34.0 (X) 41.8 41.8 (X)
40 – Represents zero. X Not applicable. 1 Buying,...
它拿起了两张桌子,显然第一张仍然是第一张。不需要的文本也有同样的问题,但现在是预期的。
第四:table_regions
在'33,297,386,65'
PDF区域上使用(底部表格)
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
0 1 2 3 4 5
0 Table 325. Arrests by Race: 2009
1 [Based on Uniform Crime Reporting (UCR) Progra...
2 with a total population of 239,839,971 as esti...
3 American
4 Offense charged Indian/Alaskan Asian Pacific
5 Total White Black Native Islander
6 Total . . . . . . . . . . . . . . . . ... 10,690,561 7,389,208 3,027,153 150,544 123,656
7 Violent crime . . . . . . . . . . . ... 456,965 268,346 177,766 5,608 5,245
8 Murder and nonnegligent manslaughter . .. ... . 9,739 4,741 4,801 100 97
9 Forcible rape . . . . . . . .. .. .. .. .... .... 16,362 10,644 5,319 169 230
10 Robbery . . . . .. . . . ... . ... . .... ....... 100,496 43,039 55,742 726 989
11 Aggravated assault . . . . . . . .. .. ......... 330,368 209,922 111,904 4,613 3,929
....
34 All other offenses (except traffic) . .. .. ..... 2,929,217 1,937,221 911,670 43,880 36,446
35 Suspicion . . .. . . . .. .. .. .. .. .. .. ..... 1,513 677 828 1 7
36 Curfew and loitering law violations . .. ... ... 89,578 54,439 33,207 872 1,060
37 Runaways . . . . . . . .. .. .. .. .. .. ....... 73,616 48,343 19,670 1,653 3,950
38 1 Except forcible rape and prostitution.
更好,但它会像上面那样拾取不需要的文本。
我真的很重视建议或指示。提前致谢!