3

作为项目的一部分,我和其他一些人目前正在研究 URL 分类器。我们试图实现的实际上非常简单:我们只需查看 URL 并找到其中出现的相关关键字并相应地对页面进行分类。

例如:如果网址是:http://cnnworld/sports/abcd,我们会将其归类为“体育”类别

为此,我们有一个具有以下格式映射的数据库:关键字 -> 类别

现在我们正在做的是,对于每个 URL,我们不断读取数据库中的所有数据项,并使用 String.find() 方法查看关键字是否出现在 URL 中。一旦找到,我们就停下来。

但是这种方法有一些问题,主要是:

(i) 我们的数据库非常大,这样的重复查询运行速度非常慢

(ii) 一个页面可能属于多个类别,我们的方法不处理此类情况。当然,确保这一点的一种简单方法是即使找到类别匹配也继续查询数据库,但这只会让事情变得更慢。

我正在考虑替代方案,并且想知道是否可以进行相反的操作-解析 url,查找其中出现的单词,然后仅在数据库中查询这些单词。

一个简单的算法将在 O( n^2 ) 中运行 - 查询数据库中出现在 url 中的所有子字符串。

我想知道是否有更好的方法来实现这一点。有任何想法吗 ??先感谢您 :)

4

4 回答 4

2

在我们的商业分类器中,我们有一个包含 4m 关键字的数据库 :) 我们还搜索 HTML 的正文,有很多方法可以解决这个问题:

  1. 使用 Aho-Corasick,我们使用了一种专门针对网页内容的改进算法,例如将:tab、空格、\r、\n 视为空格,仅视为一个,因此两个空格将被视为一个空格,并且忽略小写/大写。
  2. 另一种选择是将所有关键字放在树中(例如std :: map),这样搜索会变得非常快,缺点是这会占用内存,而且很多,但如果它在服务器上,你不会觉得它。
于 2012-10-14T00:52:57.137 回答
1

Aho-corasick 算法最适合一次遍历搜索中间字符串。您可以形成您的关键字的树(aho-corasick 树)。最后一个节点包含一个用特定关键字映射的数字。

现在,您只需要遍历树上的 URL 字符串。当你得到一些数字(在我们的场景中作为标志),这意味着我们得到了一些映射的类别。继续在哈希图上使用该数字并找到相应的类别以供进一步使用。

我想这会对你有所帮助。转到此链接:ivan 的 aho-corasick 的好动画

于 2012-08-21T09:07:11.807 回答
1

我认为您建议拆分 URL 以查找有用的位,然后仅查询这些项目听起来是一个不错的方法。

我将一些 Java 拼凑在一起,这可能有助于从代码方面说明我认为这会带来什么。最有价值的部分可能是正则表达式,但我希望它的一般算法也能有所帮助:

import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.util.List;

public class CategoryParser 
{
    /** The db field that keywords should be checked against */
    private static final String DB_KEYWORD_FIELD_NAME = "keyword";

    /** The db field that categories should be pulled from */
    private static final String DB_CATEGORY_FIELD_NAME = "category";

    /** The name of the table to query */
    private static final String DB_TABLE_NAME = "KeywordCategoryMap";

    /**
     * This method takes a URL and from that text alone determines what categories that URL belongs in.
     * @param url - String URL to categorize
     * @return categories - A List<String&rt; of categories the URL seemingly belongs in
     */
    public static List<String> getCategoriesFromUrl(String url) {

        // Clean the URL to remove useless bits and encoding artifacts
        String normalizedUrl = normalizeURL(url);

        // Break the url apart and get the good stuff
        String[] keywords = tokenizeURL(normalizedUrl);

        // Construct the query we can query the database with
        String query = constructKeywordCategoryQuery(keywords);

        System.out.println("Generated Query: " + query);

        // At this point, you'd need to fire this query off to your database,
        // and the results you'd get back should each be a valid category
        // for your URL. This code is not provided because it's very implementation specific,
        // and you already know how to deal with databases.


        // Returning null to make this compile, even though you'd obviously want to return the
        // actual List of Strings
        return null;
    }

    /**
     * Removes the protocol, if it exists, from the front and
     * removes any random encoding characters
     * Extend this to do other url cleaning/pre-processing
     * @param url - The String URL to normalize
     * @return normalizedUrl - The String URL that has no junk or surprises
     */
    private static String normalizeURL(String url)
    {
        // Decode URL to remove any %20 type stuff
        String normalizedUrl = url;
        try {
            // I've used a URLDecoder that's part of Java here,
            // but this functionality exists in most modern languages
            // and is universally called url decoding
            normalizedUrl = URLDecoder.decode(url, "UTF-8");
        }
        catch(UnsupportedEncodingException uee)
        {
            System.err.println("Unable to Decode URL. Decoding skipped.");
            uee.printStackTrace();
        }

        // Remove the protocol, http:// ftp:// or similar from the front
        if (normalizedUrl.contains("://"))
        {
            normalizedUrl = normalizedUrl.split(":\\/\\/")[1];
        }

        // Room here to do more pre-processing

        return normalizedUrl;
    }

    /**
     * Takes apart the url into the pieces that make at least some sense
     * This doesn't guarantee that each token is a potentially valid keyword, however
     * because that would require actually iterating over them again, which might be
     * seen as a waste.
     * @param url - Url to be tokenized
     * @return tokens - A String array of all the tokens
     */
    private static String[] tokenizeURL(String url)
    {
        // I assume that we're going to use the whole URL to find tokens in
        // If you want to just look in the GET parameters, or you want to ignore the domain
        // or you want to use the domain as a token itself, that would have to be
        // processed above the next line, and only the remaining parts split
        String[] tokens = url.split("\\b|_");

        // One could alternatively use a more complex regex to remove more invalid matches
        // but this is subject to your (?:in)?ability to actually write the regex you want

        // These next two get rid of tokens that are too short, also.

        // Destroys anything that's not alphanumeric and things that are
        // alphanumeric but only 1 character long
        //String[] tokens = url.split("(?:[\\W_]+\\w)*[\\W_]+");

        // Destroys anything that's not alphanumeric and things that are
        // alphanumeric but only 1 or 2 characters long
        //String[] tokens = url.split("(?:[\\W_]+\\w{1,2})*[\\W_]+");

        return tokens;
    }

    private static String constructKeywordCategoryQuery(String[] keywords)
    {
        // This will hold our WHERE body, keyword OR keyword2 OR keyword3
        StringBuilder whereItems = new StringBuilder();

        // Potential query, if we find anything valid
        String query = null;

        // Iterate over every found token
        for (String keyword : keywords)
        {
            // Reject invalid keywords
            if (isKeywordValid(keyword))
            {
                // If we need an OR
                if (whereItems.length() > 0)
                {
                    whereItems.append(" OR ");
                }

                // Simply append this item to the query
                // Yields something like "keyword='thisKeyword'"
                whereItems.append(DB_KEYWORD_FIELD_NAME);
                whereItems.append("='");
                whereItems.append(keyword);
                whereItems.append("'");
            }
        }

        // If a valid keyword actually made it into the query
        if (whereItems.length() > 0)
        {
            query = "SELECT DISTINCT(" + DB_CATEGORY_FIELD_NAME + ") FROM " + DB_TABLE_NAME
                    + " WHERE " + whereItems.toString() + ";";
        }

        return query;
    }

    private static boolean isKeywordValid(String keyword)
    {
        // Keywords better be at least 2 characters long
        return keyword.length() > 1
                // And they better be only composed of letters and numbers
                && keyword.matches("\\w+")
                // And they better not be *just* numbers
                // && !keyword.matches("\\d+") // If you want this
                ;
    }

    // How this would be used
    public static void main(String[] args)
    {
        List<String> soQuestionUrlClassifications = getCategoriesFromUrl("http://stackoverflow.com/questions/10046178/pattern-matching-for-url-classification");
        List<String> googleQueryURLClassifications = getCategoriesFromUrl("https://www.google.com/search?sugexp=chrome,mod=18&sourceid=chrome&ie=UTF-8&q=spring+is+a+new+service+instance+created#hl=en&sugexp=ciatsh&gs_nf=1&gs_mss=spring%20is%20a%20new%20bean%20instance%20created&tok=lnAt2g0iy8CWkY65Te75sg&pq=spring%20is%20a%20new%20bean%20instance%20created&cp=6&gs_id=1l&xhr=t&q=urlencode&pf=p&safe=off&sclient=psy-ab&oq=url+en&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=2176d1af1be1f17d&biw=1680&bih=965");
    }
}

SO 链接的生成查询如下所示:

SELECT DISTINCT(category) FROM KeywordCategoryMap WHERE keyword='stackoverflow' OR keyword='com' OR keyword='questions' OR keyword='10046178' OR keyword='pattern' OR keyword='matching' OR keyword='for' OR keyword='url' OR keyword='classification'

有很大的优化空间,但我想它比检查每个可能的关键字的字符串要快得多。

于 2012-07-06T20:43:52.990 回答
0

如果您的类别比关键字少(很多),您可以为每个类别创建一个正则表达式,它将匹配该类别的任何关键字。然后,您将针对每个类别的正则表达式运行您的 URL。这也将解决匹配多个类别的问题。

于 2012-04-06T17:42:56.763 回答