我一直在尝试<img>
使用从第二个 html 文件中获得的信息来搜索和替换 html 文件中的某些属性。我lxml
从 BeautifulSoup 使用,但我显然做错了什么,无法弄清楚是什么。
我尝试了三个不同的版本。未注释的至少不会抛出错误,但生成的 html 文件不变。我猜它没有对汤进行更改?
帮助或只是一些指针将不胜感激。
编辑:感谢@Tserenjamts,我意识到代码正确地修改了html,而不是更改原始代码,而是在其后附加了新版本。我可以忍受,以防万一(在不同的文件上写入或在写入之前擦除原件),但感觉出乎意料。我错过了什么?我应该使用replace
而不是直接分配这个 SO 答案中的新值吗?
我的代码:
'''
get <img> attribute from ocrd file and apply them to final file
'''
from bs4 import BeautifulSoup as bs
clean = 'ver1.html'
ocrd = 'ver1-before.html'
with open(ocrd, 'r') as source:
hocr = source.read()
soup2 = bs(hocr, 'lxml')
with open(clean, 'r+') as final:
html = final.read()
soup = bs(html, 'lxml')
images = soup.find_all('img')
for i in images:
img = soup2.find('img', src=i['src'])
style = img['style']
path = i['src'][i['src'].find('crops/'):]
print('new path:', path) # [-20:]
print('style:', style)
print('i', i)
# third trial
i['src'] = path
i['style'] = style
# second trial
# i.set('src', path)
# i.set('style', style)
# first trial
# el = soup.find('img', src=i['src'])
# el.set('src', path)
# el.set('style', style)
# print('el:', el)
final.write(str(soup))
“ver1.html”文件的示例:
<p class="dropzone hidden_finalize" style="display: none;"></p><p class="ocr_par" id="op-0" title="bbox 157 589 579 720" draggable="true" lang="eng">
<span class="ocr_line" id="line_1_1" title="bbox 157 589 557 617; baseline -0.003 -5; x_size 27; x_descenders 5; x_ascenders 6">
<span class="ocrx_word" id="word_1_1" title="bbox 157 589 289 617; x_wconf 96">According</span>
<span class="ocrx_word" id="word_1_2" title="bbox 300 590 324 612; x_wconf 96">to</span>
<span class="ocrx_word" id="word_1_3" title="bbox 335 589 459 612; x_wconf 95">electronic</span>
<span class="ocrx_word" id="word_1_4" title="bbox 470 589 557 612; x_wconf 96">control</span>
</span>
<span class="ocr_line" id="line_1_2" title="bbox 159 625 579 653; baseline 0 -6; x_size 28; x_descenders 6; x_ascenders 6">
<span class="ocrx_word" id="word_1_5" title="bbox 159 625 218 653; x_wconf 96">logic</span>
<span class="ocrx_word" id="word_1_6" title="bbox 228 625 275 648; x_wconf 96">and</span>
<span class="ocrx_word" id="word_1_7" title="bbox 286 625 316 648; x_wconf 96">air</span>
<span class="ocrx_word" id="word_1_8" title="bbox 327 625 439 653; x_wconf 96">recycling</span>
<span class="ocrx_word" id="word_1_9" title="bbox 451 625 579 653; x_wconf 96">principles,</span>
</span>
<span class="ocr_line" id="line_1_3" title="bbox 157 661 568 689; baseline 0 -6; x_size 28; x_descenders 6; x_ascenders 6">
<span class="ocrx_word" id="word_1_10" title="bbox 157 661 198 684; x_wconf 96">the</span>
<span class="ocrx_word" id="word_1_11" title="bbox 207 661 320 689; x_wconf 96">following</span>
<span class="ocrx_word" id="word_1_12" title="bbox 332 661 396 684; x_wconf 96">chart</span>
<span class="ocrx_word" id="word_1_13" title="bbox 407 661 518 689; x_wconf 96">analyses</span>
<span class="ocrx_word" id="word_1_14" title="bbox 528 661 568 683; x_wconf 96">the</span>
</span>
<span class="ocr_line" id="line_1_4" title="bbox 159 698 317 720; baseline 0 0; x_size 28.32962; x_descenders 6.3296204; x_ascenders 5">
<span class="ocrx_word" id="word_1_15" title="bbox 159 698 209 720; x_wconf 96">root</span>
<span class="ocrx_word" id="word_1_16" title="bbox 219 703 317 720; x_wconf 96">causes.</span>
</span>
</p>
<p class="dropzone hidden_finalize" style="display: none;"></p>
<img src="http://127.0.0.1:8000/static/media/giampaolo.ferradini/PDF_version_2.0_12_Pages_wlT4Fq0/crops/1-2.png" style="width: 745px; height: 415px;" class="show_finalize"><p class="image_p hidden_finalize" style="width: 745px; height: 415px; background-image: url("/static/media/giampaolo.ferradini/PDF_version_2.0_12_Pages_wlT4Fq0/crops/1-2.png"); background-size: contain; display: none;" id="op-1" draggable="true"> </p>
<p class="dropzone hidden_finalize" style="display: none;"></p>
<p class="ocr_par" id="op-2" title="bbox 159 1361 527 1389" draggable="true" lang="eng">
<span class="ocr_line" id="line_1_1" title="bbox 159 1361 527 1389; baseline 0 -6; x_size 28; x_descenders 6; x_ascenders 6">
<span class="ocrx_word" id="word_1_1" title="bbox 159 1361 263 1389; x_wconf 96">Priority</span>
<span class="ocrx_word" id="word_1_2" title="bbox 272 1361 388 1389; x_wconf 96">analysis</span>
<span class="ocrx_word" id="word_1_3" title="bbox 398 1367 430 1383; x_wconf 96">as</span>
<span class="ocrx_word" id="word_1_4" title="bbox 441 1361 527 1383; x_wconf 96">below</span>
</span>
</p>
<p class="dropzone hidden_finalize" style="display: none;"></p>
<img src="http://127.0.0.1:8000/static/media/giampaolo.ferradini/PDF_version_2.0_12_Pages_wlT4Fq0/crops/1-4.png" style="width: 746px; height: 449px;" class="show_finalize"><p class="image_p hidden_finalize" style="width: 746px; height: 449px; background-image: url("/static/media/giampaolo.ferradini/PDF_version_2.0_12_Pages_wlT4Fq0/crops/1-4.png"); background-size: contain; display: none;" id="op-3" draggable="true"> </p>
和 ver1.html 文件:
<p>According to electronic control logic and air recycling principles, the following chart analyses the root causes.</p>
<p><img src="http://127.0.0.1:8000/static/media/giampaolo.ferradini/PDF_version_2.0_12_Pages_wlT4Fq0/crops/1-2.png"></p>
<p>Priority analysis as below</p>
<p><img src="http://127.0.0.1:8000/static/media/giampaolo.ferradini/PDF_version_2.0_12_Pages_wlT4Fq0/crops/1-4.png"></p>