1

EDITED, Please Read Again, As I added some work of mine

My task is to compare templates of two URLS. I am ready with my algorithm. But it takes too much time to give final answer.

I wrote my code in Java using Jsoup and Selenium

Here, Templates means the way any page presents its contents.

Example:-

Any shopping website have page of any Shoes, that contains,

Images in the left.
Price and Size in the right.
Reviews in the bottom.

If two URLS are of any specific product , then it return "Both are from same templates". Example , this link and this link have same template.

If one URL shows any product and another URL shows any category ,then it shows "No match". Example, this link and this link are from different template.

I think that this algorithm requires some optimization, that's why I am posting this question in this forum.

My algorithm

  1. Fetch, parse two input URLS and make their DOM trees.
  2. Then if any page contains , UL and TABLE , then remove that tag. I done this because, may be two pages contains different number of items.
  3. Then, I count number of tags in both URLS. say, initial_tag1, initial_tag2.
  4. Then, I start removing tags that have same position on corresponding pages and same Id and their below subtree, if that tree has number of nodes less than 10.
  5. 然后,如果该树的节点数少于 10,我开始删除在对应页面上具有相同位置和相同类名称及其下面子树的标签。
  6. 然后,如果该树的节点数少于 10,我开始删除没有 Id 和 No Class name 的标签及其下面的子树。
  7. 步骤 4、5、6 具有 (N*N) 复杂度。这里,N 是标签的数量。[这样,每一步DOM树都会收缩]
  8. 当它从这个递归中出来时,我检查 final_tag1 和 final_tag2。
  9. 如果 final_tag1 和 final_tag2 小于 initial_tag1*(0.2) 和 initial_tag2*(0.2) 那么我可以这么说Two URL matched,否则not

我对这个算法想了很多,我发现从 DOM 树中删除节点是一个非常缓慢的过程。这可能是减慢该算法的罪魁祸首。

我从一些极客那里讨论过,并且

they said that use a score for every tag instead of removing them, and add them , and > at the end return (score I Got)/(accumulatedPoints) or something similar, and on the basis of that you decide two URLS are either similar or not.

But I didn't understand this. So can you explain this saying of some geek, or can you give any other optimized algorithm, that solve this problem efficiently.

Thanks in advance. Looking for your kind response.

4

2 回答 2

2

For comparing webpages there basically two ways, the fast and the slow one :

  1. Compare URLS : fast
  2. Compare DOM : slow (and complicated)

In your case, it appears that the first two items match a similar regular expression and the categories match another regexp.

Here is a short JAVA solution

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TestRegexp {
public static void main(String[] args) {
    String URL_ITEM_1 = "http://www.jabong.com/Puma-Flash-Ind-Black-Running-Shoes-187831.html";
    String URL_ITEM_2 = "http://www.jabong.com/Lara-Karen-Full-Sleeve-Black-Polyester-Top-With-Cotton-Lace-196636.html";
    String URL_CATEGORY_1 = "http://www.jabong.com/kids/shoes/floaters/";
    String URL_CATEGORY_2 = "http://www.jabong.com/women/clothing/womens-tops/";

    Pattern itemPattern = Pattern.compile("http://www\\.jabong.com/([\\w\\p{Punct}\\d]+)\\.html");
    Pattern categoryPattern = Pattern.compile("http://www\\.jabong.com/([\\w\\p{Punct}]+/)+");

    System.out.println("Matching items");
    Matcher matcher = itemPattern.matcher(URL_ITEM_1);
    System.out.println(matcher.matches());
    matcher = itemPattern.matcher(URL_ITEM_2);
    System.out.println(matcher.matches());
    matcher = itemPattern.matcher(URL_CATEGORY_1);
    System.out.println(matcher.matches());
    matcher = itemPattern.matcher(URL_CATEGORY_2);
    System.out.println(matcher.matches());

    System.out.println("Matching categories");
    Matcher category = categoryPattern.matcher(URL_ITEM_1);
    System.out.println(category.matches());
    category = categoryPattern.matcher(URL_ITEM_2);
    System.out.println(category.matches());
    category = categoryPattern.matcher(URL_CATEGORY_1);
    System.out.println(category.matches());
    category = categoryPattern.matcher(URL_CATEGORY_2);
    System.out.println(category.matches());
}
}

And the output :

Matching items
true
true
false
false
Matching categories
false
false
true
true

It validates the first two first URLS as being items, the two last as being categories.

I hope it matches your requirement. Feel free to adapt in JS.

于 2013-03-31T09:09:55.900 回答
1

To improve complexity of your algorithm, supposing you are using Jsoup, you must adapt your data structure to your algorithm.

4) What do you mean by position of tag ? the Xpath of the tag ? If yes, precompute this value once for each tag O(n) and store this value in each node. If required you can also store it in a HashMap to retrieve in O(1).

5) Index you tag by class name using MultiMap. You will save lot of computation

6) Index class with no Id, no class name

All these pre computations can be performed in one traversal of the tree so O(n).

Generally if you want to reduce computation, you will have to store more data in the memory. Since DOM page are very small data, this is no problem in your case.

于 2013-03-31T13:09:59.903 回答