具有相同层次结构的片段示例:
(1)
<div>
<span>It's a message</span>
</div>
(2)
<div>
<span class='bold'>This is a new text</span>
</div>
具有不同结构的片段示例:
(1)
<div>
<span><b>It's a message</b></span>
</div>
(2)
<div>
<span>This is a new text</span>
</div>
因此,具有相似结构的片段对应于一棵分层树(相同的标签名称,相同的分层结构)。
如何仅使用 lxml 检测 2 个元素(html 片段)是否具有相同的结构?
对于一些更困难的情况(比示例),我有一个功能无法正常工作:
def _is_equal( el1, el2 ):
# input: 2 elements with possible equal structure and tag names
# e.g. root = lxml.html.fromstring( buf )
# el1 = root[ 0 ]
# el2 = root[ 1 ]
# move from top to bottom, compare elements
result = False
if el1.tag == el2.tag:
# has no children
if len( el1 ) == len( el2 ):
if len( el1 ) == 0:
return True
else:
# iterate one of them, for example el1
i = 0
for child1 in el1:
child2 = el2[ i ]
is_equal2 = _is_equal( child1, child2 )
if not is_equal2:
return False
return True
else:
return False
else:
return False
该代码无法检测到 2 个带有 class='tovar2' 的 div 具有相同的结构:
<body>
<div class="tovar2">
<h2 class="new">
<a href="http://modnyedeti-krsk.ru/magazin/product/333193003">
Куртка д/д
</a>
</h2>
<ul class="art">
<li>
Артикул: <span>1759</span>
</li>
</ul>
<div>
<div class="wrap" style="width:180px;">
<div class="new">
<img src="shop_files/new-t.png" alt="">
</div>
<a class="highslide" href="http://modnyedeti-krsk.ru/d/459730/d/820.jpg" onclick="return hs.expand(this)">
<img src="shop_files/fr_5.gif" style="background:url(/d/459730/d/548470803_5.jpg) 50% 50% no-repeat scroll;" alt="Куртка д/д" height="160" width="180">
</a>
</div>
</div>
<form action="" onsubmit="return addProductForm(17094601,333193003,3150.00,this,false);">
<ul class="bott ">
<li class="price">Цена:<br>
<span>
<b>
3 150
</b> руб.
</span>
</li>
<li class="amount">Кол-во:<br><input class="number" onclick="this.select()" value="1" name="product_amount" type="text">
</li>
<li class="buy"><input value="" type="submit">
</li>
</ul>
</form>
</div>
<div class="tovar2">
<h2 class="new">
<a href="http://modnyedeti-krsk.ru/magazin/product/333124803">Куртка д/д</a>
</h2>
<ul class="art">
<li>
Артикул: <span>1759</span>
</li>
</ul>
<div>
<div class="wrap" style="width:180px;">
<div class="new">
<img src="shop_files/new-t.png" alt="">
</div>
<a class="highslide" href="http://modnyedeti-krsk.ru/d/459730/d/820.jpg" onclick="return hs.expand(this)">
<img src="shop_files/fr_5.gif" style="background:url(/d/459730/d/548470803_5.jpg) 50% 50% no-repeat scroll;" alt="Куртка д/д" height="160" width="180">
</a>
</div>
</div>
<form action="" onsubmit="return addProductForm(17094601,333124803,3150.00,this,false);">
<ul class="bott ">
<li class="price">Цена:<br>
<span>
<b>3 150</b> руб.
</span>
</li>
<li class="amount">Кол-во:<br><input class="number" onclick="this.select()" value="1" name="product_amount" type="text">
</li>
<li class="buy">
<input value="" type="submit">
</li>
</ul>
</form>
</div>
</body>