string - 如何在 XML/HTML 中查找重复元素的结构

Question

我目前正在尝试解决编程问题。我试图在任何 HTML 页面中查找重复的结构，并试图检索这些元素的值。

例如，我有一个带有重复元素的 HTML 页面，如下所示：

<html>
<body>
  <ul>
     <li>green</li>
     <li>orange</li>
     <li>red</li>
  </ul>
</body>

在这段代码中，我想检测是否存在重复块（“li”项），并且我想提取它们的值。另一个 HTML 示例：

<table>
   <tr>
      <td>1</td>
      <td>John</td>
   </tr>
   <tr>
      <td>2</td>
      <td>Simon</td>
   </tr>
</table>

在此示例中，我想检测结构是否重复，并从中获取值 [1,John] 和 [2,Simon]。

我的问题是：是否有一个简单的算法来做这样的事情，或者，如果没有，你将如何处理这样的事情？

score 2 · Accepted Answer

下面显示了一个相当基本的 python 程序，它检测重复的 tr-td-td 标签序列和重复的 td 标签。将第二个 html 示例保存在 filexml.html中，程序会打印出：

tr.td.td 

td 1
td John
tr.td.td 

td 2
td Simon
Counter({'td': 4, 'tr.td.td': 2, 'table.tr.tr': 1})

#!/usr/bin/env python
from xml.etree import cElementTree as ET
from collections import Counter

def sot(r, depth):
    tags = r.tag
    for e in r.getchildren():
        tags += '.' + sot(e, depth+1)
    r.tail = tags
    cc[r.tail] += 1
    return r.tag

def tot(r, depth):
    if cc[r.tail] > 1:
        print r.tail, r.text
    for e in r.getchildren():
        tot(e, depth+1)

cc = Counter()
p=ET.parse ("xml.html")
sot(p.getroot(), 0)
tot(p.getroot(), 0)
print cc

string - 如何在 XML/HTML 中查找重复元素的结构

1 回答 1

Related

Reference