EDITED, Please Read Again, As I added some work of mine
My task is to compare templates of two URLS. I am ready with my algorithm. But it takes too much time to give final answer.
I wrote my code in Java using Jsoup and Selenium
Here, Templates means the way any page presents its contents.
Example:-
Any shopping website have page of any Shoes, that contains,
Images in the left.
Price and Size in the right.
Reviews in the bottom.
If two URLS are of any specific product , then it return "Both are from same templates". Example , this link and this link have same template.
If one URL shows any product and another URL shows any category ,then it shows "No match". Example, this link and this link are from different template.
I think that this algorithm requires some optimization, that's why I am posting this question in this forum.
My algorithm
- Fetch, parse two input URLS and make their DOM trees.
- Then if any page contains , UL and TABLE , then remove that tag. I done this because, may be two pages contains different number of items.
- Then, I count number of tags in both URLS. say, initial_tag1, initial_tag2.
- Then, I start removing tags that have same position on corresponding pages and same Id and their below subtree, if that tree has number of nodes less than 10.
- 然后,如果该树的节点数少于 10,我开始删除在对应页面上具有相同位置和相同类名称及其下面子树的标签。
- 然后,如果该树的节点数少于 10,我开始删除没有 Id 和 No Class name 的标签及其下面的子树。
- 步骤 4、5、6 具有 (N*N) 复杂度。这里,N 是标签的数量。[这样,每一步DOM树都会收缩]
- 当它从这个递归中出来时,我检查 final_tag1 和 final_tag2。
- 如果 final_tag1 和 final_tag2 小于 initial_tag1*(0.2) 和 initial_tag2*(0.2) 那么我可以这么说
Two URL matched
,否则not
。
我对这个算法想了很多,我发现从 DOM 树中删除节点是一个非常缓慢的过程。这可能是减慢该算法的罪魁祸首。
我从一些极客那里讨论过,并且
they said that use a score for every tag instead of removing them, and add them , and > at the end return (score I Got)/(accumulatedPoints) or something similar, and on the basis of that you decide two URLS are either similar or not.
But I didn't understand this. So can you explain this saying of some geek, or can you give any other optimized algorithm, that solve this problem efficiently.
Thanks in advance. Looking for your kind response.