1

我正在使用 terrasect ocr 从图像生成文本文件。完成后,输出文件包含 terrasect 可以从图像文件中获取的所有文本,并且还包含文本的 x、y 坐标。

这是输出文件的一个小示例:

  <p class='ocr_par' dir='ltr' id='par_14' title="bbox 65 1198 2904 1245">
 <span class='ocr_line' id='line_17' title="bbox 65 1198 2904 1245"><span class='ocrx_word' id='word_286' title="bbox 65 1200 287 1237">Tormented</span> <span class='ocrx_word' id='word_287' title="bbox 307 1200 391 1237">Soul</span> <span class='ocrx_word' id='word_288' title="bbox 659 1203 682 1237">2</span> <span class='ocrx_word' id='word_289' title="bbox 805 1203 828 1237">2</span> <span class='ocrx_word' id='word_290' title="bbox 953 1203 1133 1237">Common</span> <span class='ocrx_word' id='word_291' title="bbox 1247 1203 1331 1237"><strong>M13</strong></span> <span class='ocrx_word' id='word_292' title="bbox 1484 1200 1651 1245"><strong>111/249</strong></span> <span class='ocrx_word' id='word_293' title="bbox 1690 1198 1729 1237">9</span> <span class='ocrx_word' id='word_294' title="bbox 1922 1200 2143 1237">Tormented</span> <span class='ocrx_word' id='word_295' title="bbox 2164 1200 2248 1237">Soul</span> <span class='ocrx_word' id='word_296' title="bbox 2268 1200 2365 1237">can&#39;t</span> <span class='ocrx_word' id='word_297' title="bbox 2383 1200 2489 1237">block</span> <span class='ocrx_word' id='word_298' title="bbox 2508 1200 2581 1237">and</span> <span class='ocrx_word' id='word_299' title="bbox 2604 1203 2630 1237">is</span> <span class='ocrx_word' id='word_300' title="bbox 2651 1200 2904 1237">unblockable.</span> 
 </span>
</p>

<p class='ocr_par' dir='ltr' id='par_15' title="bbox 65 1263 4323 1312">
 <span class='ocr_line' id='line_18' title="bbox 65 1263 4323 1312"><span class='ocrx_word' id='word_301' title="bbox 65 1265 229 1302">Veilborn</span> <span class='ocrx_word' id='word_302' title="bbox 250 1265 365 1302">Ghoul</span> <span class='ocrx_word' id='word_303' title="bbox 659 1268 682 1302">2</span> <span class='ocrx_word' id='word_304' title="bbox 805 1268 828 1302">2</span> <span class='ocrx_word' id='word_305' title="bbox 953 1268 1182 1302">Uncommon</span> <span class='ocrx_word' id='word_306' title="bbox 1247 1268 1331 1302"><strong>M13</strong></span> <span class='ocrx_word' id='word_307' title="bbox 1484 1265 1651 1310"><strong>114/249</strong></span> <span class='ocrx_word' id='word_308' title="bbox 1690 1263 1771 1302">09</span> <span class='ocrx_word' id='word_309' title="bbox 1922 1265 2086 1302">Veilborn</span> <span class='ocrx_word' id='word_310' title="bbox 2107 1265 2222 1302">Ghoul</span> <span class='ocrx_word' id='word_311' title="bbox 2242 1265 2339 1302">can&#39;t</span> <span class='ocrx_word' id='word_312' title="bbox 2357 1265 2677 1302">b|ock.Whenever</span> <span class='ocrx_word' id='word_313' title="bbox 2698 1276 2719 1302">a</span> <span class='ocrx_word' id='word_314' title="bbox 2742 1268 2886 1312">Swamp</span> <span class='ocrx_word' id='word_315' title="bbox 2906 1268 3029 1302">enters</span> <span class='ocrx_word' id='word_316' title="bbox 3047 1265 3110 1302">the</span> <span class='ocrx_word' id='word_317' title="bbox 3130 1265 3328 1302">battlefield</span> <span class='ocrx_word' id='word_318' title="bbox 3349 1265 3464 1302">under</span> <span class='ocrx_word' id='word_319' title="bbox 3484 1276 3573 1312">your</span> <span class='ocrx_word' id='word_320' title="bbox 3594 1265 3747 1310">control,</span> <span class='ocrx_word' id='word_321' title="bbox 3766 1276 3839 1312">you</span> <span class='ocrx_word' id='word_322' title="bbox 3857 1276 3940 1312">may</span> <span class='ocrx_word' id='word_323' title="bbox 3961 1268 4081 1302">return</span> <span class='ocrx_word' id='word_324' title="bbox 4102 1265 4266 1302">Veilborn</span> <span class='ocrx_word' id='word_325' title="bbox 4289 1265 4323 1302">GI</span> 
 </span>
</p>

这里有两个关键词。一个是受折磨的灵魂,另一个是面纱食尸鬼。

我正在尝试让 python 打开输出文件并搜索受折磨的灵魂,然后获取 x,y 坐标

恰好是 65 1200 287 1237

感谢您提前提供帮助。我是python的新手。

4

1 回答 1

0

我建议您使用pyquery进行解析:

from pyquery import PyQuery as pq

html = '''  <p class='ocr_par' dir='ltr' id='par_14' title="bbox 65 1198 2904 1245">
 <span class='ocr_line' id='line_17' title="bbox 65 1198 2904 1245"><span class='ocrx_word' id='word_286' title="bbox 65 1200 287 1237">Tormented</span> <span class='ocrx_word' id='word_287' title="bbox 307 1200 391 1237">Soul</span> <span class='ocrx_word' id='word_288' title="bbox 659 1203 682 1237">2</span> <span class='ocrx_word' id='word_289' title="bbox 805 1203 828 1237">2</span> <span class='ocrx_word' id='word_290' title="bbox 953 1203 1133 1237">Common</span> <span class='ocrx_word' id='word_291' title="bbox 1247 1203 1331 1237"><strong>M13</strong></span> <span class='ocrx_word' id='word_292' title="bbox 1484 1200 1651 1245"><strong>111/249</strong></span> <span class='ocrx_word' id='word_293' title="bbox 1690 1198 1729 1237">9</span> <span class='ocrx_word' id='word_294' title="bbox 1922 1200 2143 1237">Tormented</span> <span class='ocrx_word' id='word_295' title="bbox 2164 1200 2248 1237">Soul</span> <span class='ocrx_word' id='word_296' title="bbox 2268 1200 2365 1237">can&#39;t</span> <span class='ocrx_word' id='word_297' title="bbox 2383 1200 2489 1237">block</span> <span class='ocrx_word' id='word_298' title="bbox 2508 1200 2581 1237">and</span> <span class='ocrx_word' id='word_299' title="bbox 2604 1203 2630 1237">is</span> <span class='ocrx_word' id='word_300' title="bbox 2651 1200 2904 1237">unblockable.</span> 
 </span>
</p>

<p class='ocr_par' dir='ltr' id='par_15' title="bbox 65 1263 4323 1312">
 <span class='ocr_line' id='line_18' title="bbox 65 1263 4323 1312"><span class='ocrx_word' id='word_301' title="bbox 65 1265 229 1302">Veilborn</span> <span class='ocrx_word' id='word_302' title="bbox 250 1265 365 1302">Ghoul</span> <span class='ocrx_word' id='word_303' title="bbox 659 1268 682 1302">2</span> <span class='ocrx_word' id='word_304' title="bbox 805 1268 828 1302">2</span> <span class='ocrx_word' id='word_305' title="bbox 953 1268 1182 1302">Uncommon</span> <span class='ocrx_word' id='word_306' title="bbox 1247 1268 1331 1302"><strong>M13</strong></span> <span class='ocrx_word' id='word_307' title="bbox 1484 1265 1651 1310"><strong>114/249</strong></span> <span class='ocrx_word' id='word_308' title="bbox 1690 1263 1771 1302">09</span> <span class='ocrx_word' id='word_309' title="bbox 1922 1265 2086 1302">Veilborn</span> <span class='ocrx_word' id='word_310' title="bbox 2107 1265 2222 1302">Ghoul</span> <span class='ocrx_word' id='word_311' title="bbox 2242 1265 2339 1302">can&#39;t</span> <span class='ocrx_word' id='word_312' title="bbox 2357 1265 2677 1302">b|ock.Whenever</span> <span class='ocrx_word' id='word_313' title="bbox 2698 1276 2719 1302">a</span> <span class='ocrx_word' id='word_314' title="bbox 2742 1268 2886 1312">Swamp</span> <span class='ocrx_word' id='word_315' title="bbox 2906 1268 3029 1302">enters</span> <span class='ocrx_word' id='word_316' title="bbox 3047 1265 3110 1302">the</span> <span class='ocrx_word' id='word_317' title="bbox 3130 1265 3328 1302">battlefield</span> <span class='ocrx_word' id='word_318' title="bbox 3349 1265 3464 1302">under</span> <span class='ocrx_word' id='word_319' title="bbox 3484 1276 3573 1312">your</span> <span class='ocrx_word' id='word_320' title="bbox 3594 1265 3747 1310">control,</span> <span class='ocrx_word' id='word_321' title="bbox 3766 1276 3839 1312">you</span> <span class='ocrx_word' id='word_322' title="bbox 3857 1276 3940 1312">may</span> <span class='ocrx_word' id='word_323' title="bbox 3961 1268 4081 1302">return</span> <span class='ocrx_word' id='word_324' title="bbox 4102 1265 4266 1302">Veilborn</span> <span class='ocrx_word' id='word_325' title="bbox 4289 1265 4323 1302">GI</span> 
 </span>
</p>'''
d = pq(html)

然后你可以这样做:

keyword = 'Tormented Soul'
coords = lambda line: map(int, line('.ocrx_word').eq(0).attr('title').split()[1:])
result = [coords(line) for line in d('.ocr_line').items() if line.text().startswith(keyword)][0]

结果:

[65, 1200, 287, 1237]

编辑:您可以使用以下命令安装 pyquery:

pip install pyquery

如果您需要从文件中读取,请使用:

d = pq(filename=path_to_html_file)
于 2013-04-22T01:08:35.173 回答