I would like to find similarities (exactly the patterns of it) in one or more HTML pages without knowing thier structure.
Lets talk about a really simplified example where the content and all attributes are removed.
01 <div>
02 <div><table>.*</table></div>
03 <div><table>.*</table></div>
04
05 <div><p></p><img/></div>
06 <div><p></p><img/></div>
07 <div><p></p><img/></div>
08
09 <div><table>.*</table></div>
10 <div><table>.*</table></div>
11 </div>
We (humans) can see that there are two differnt types of patterns. The first one (with the table) occurs four times. And there is an other one with an image tag, three times. That is easy (for humans).
The perfect module, I would like to write, would return a resultset like:
$VAR = [ { reduced_pattern => '<div><table>.*</table>div>',
real_pattern => '<!-- the real pattern -->',
hits => [{ line => 02,
content => "<div><table>foo 1</table></div>",
relevance => 0,9,
},
{ line => 03,
content => "<div><table>foo 2</table></div>",
relevance => 0,95,
},
{ line => 09,
content => "<div><table>foo 3</table></div>",
relevance => 0,87
},
{ line => 10,
content => "<div><table>foo 4</table></div>",
relevance => 0,80
}
]
},
{ real_pattern => '<!-- the real pattern -->',
hits => [{ line => 05,
content => "<div><p>bar 1</p><img/></div>",
relevance => 0,79,
},
{ line => 06,
content => "<div><p>bar 2</p><img/></div>",
relevance => 0,95,
},
{ line => 07,
content => "<div><p>bar 3</p><img/></div>",
relevance => 0,80
}
],
}
];
Something like that.
The question is about the algorithm. I searched for 'Algorithm to identify marked similarities.' and alike sentences on the web, here on SO and on CPAN, but did not find something that matched well. (I know there are a lot and I read a lot of them.)
RegExp does not come in consideration, because you have to know what you are searching for. I assume that it can be done with neuronal networks, but the learning is may be difficult. Also Fuzzy Hashes (like in sssdeep) may be a solution. Or should I better start in the direction of k-means or mahout?
Thanks for your answers and comments!