html - Algorithm to identify marked similarities in html

Question

I would like to find similarities (exactly the patterns of it) in one or more HTML pages without knowing thier structure.

Lets talk about a really simplified example where the content and all attributes are removed.

01 <div>
02  <div><table>.*</table></div>
03  <div><table>.*</table></div>
04
05  <div><p></p><img/></div>
06  <div><p></p><img/></div>
07  <div><p></p><img/></div>    
08
09  <div><table>.*</table></div>
10  <div><table>.*</table></div>
11 </div>

We (humans) can see that there are two differnt types of patterns. The first one (with the table) occurs four times. And there is an other one with an image tag, three times. That is easy (for humans).

The perfect module, I would like to write, would return a resultset like:

$VAR = [ { reduced_pattern => '<div><table>.*</table>div>',
           real_pattern => '<!-- the real pattern -->',
           hits => [{ line => 02,
                      content => "<div><table>foo 1</table></div>",
                      relevance => 0,9,
                    },
                    { line => 03,
                      content => "<div><table>foo 2</table></div>",
                      relevance => 0,95,
                    },   
                    { line => 09,
                      content => "<div><table>foo 3</table></div>",
                      relevance => 0,87
                    },
                    { line => 10,
                      content => "<div><table>foo 4</table></div>",
                      relevance => 0,80
                    }
                   ]
         }, 
         { real_pattern => '<!-- the real pattern -->',
                 hits => [{ line => 05,
                      content => "<div><p>bar 1</p><img/></div>",
                      relevance => 0,79,
                    },
                    { line => 06,
                      content => "<div><p>bar 2</p><img/></div>",
                      relevance => 0,95,
                    },   
                    { line => 07,
                      content => "<div><p>bar 3</p><img/></div>",
                      relevance => 0,80
                    }
                   ],
         }
        ];

Something like that.

The question is about the algorithm. I searched for 'Algorithm to identify marked similarities.' and alike sentences on the web, here on SO and on CPAN, but did not find something that matched well. (I know there are a lot and I read a lot of them.)

RegExp does not come in consideration, because you have to know what you are searching for. I assume that it can be done with neuronal networks, but the learning is may be difficult. Also Fuzzy Hashes (like in sssdeep) may be a solution. Or should I better start in the direction of k-means or mahout?

Thanks for your answers and comments!

score 0 · Accepted Answer

我也很难完全理解您的问题，但是如果您希望将页面分为两种不同的类型，那么要研究的一种算法可能是支持向量机。如果您能够生成预先分类的数据集，那么隐马尔可夫模型可能是要走的路。正如 xhudik 建议的那样，甚至可以使用决策树。

抱歉，如果您正在寻找具体的答案，但我认为需要更多信息（例如 2-3 个 html 示例以及如何对这些示例进行分类）来确定您想要实现的目标。

编辑：另外，你有没有研究过集群？如果您想要的是智能统计分组， Weka和Orange等产品可以为您提供帮助。

html - Algorithm to identify marked similarities in html

1 回答 1

Related

Reference