I am trying to come up with a strategy to detect the main content DIV of a site. Main content div means: The div which contains the header, body, and the footer of the site.
It's a very difficult and slow process to detect it.
For instance, on http://www.goo.ne.jp/, I would detect id="bodyWrapper" or "minWidthInbox" because these divs contain the main content on the site.
I have also tried many algorithms to do so. But bacause of weird structures of sites, and inconsistencies, it's not possible for all sites to run by a single algorithm.
Table layout is especially hard to detect. :-(
How should I approach this problem?