5

I am working on a relatively large text-based web classification problem and I am planning on using the multinomial Naive Bayes classifier in sklearn in python and the scrapy framework for the crawling. However, I am a little concerned that sklearn/python might be too slow for a problem that could involve classifications of millions of websites. I have already trained the classifier on several thousand websites from DMOZ. The research framework is as follows:

1) The crawler lands on a domain name and scrapes the text from 20 links on the site (of depth no larger than one). (The number of tokenized words here seems to vary between a few thousand to up to 150K for a sample run of the crawler) 2) Run the sklearn multionmial NB classifier with around 50,000 features and record the domain name depending on the result

My question is whether a Python-based classifier would be up to the task for such a large scale application or should I try re-writing the classifier (and maybe the scraper and word tokenizer as well) in a faster environment? If yes what might that environment be? Or perhaps Python is enough if accompanied with some parallelization of the code? Thanks

4

2 回答 2

5

例如,使用支持APIHashingVectorizer的线性分类模块之一,或者增量学习模型,而无需预先矢量化和加载内存中的所有数据,并且在数亿个分类器上学习分类器应该没有任何问题。具有数十万(散列)特征的文档。partial_fitSGDClassifierPerceptronPassiveAggresiveClassifier

但是,您应该加载一个适合内存的小子样本(例如 100k 文档),并使用 Pipeline 对象和RandomizedSearchCV主分支的类为矢量化器网格搜索好的参数。RandomizedSearchCV您还可以使用适合内存的相同或更大的预矢量化数据集(例如几百万个文档)微调正则化参数的值(例如,C 表示 PassiveAggressiveClassifier 或 alpha 表示 SGDClassifier )。

还可以对线性模型进行平均(平均2 个线性模型) coef_intercept_以便您可以对数据集进行分区,独立学习线性模型,然后对模型进行平均以获得最终模型。

于 2013-04-14T11:31:21.450 回答
3

从根本上说,如果您依赖 numpy、scipy 和 sklearn,Python 不会成为瓶颈,因为这些库的大多数关键部分都是作为 C 扩展实现的。

但是,由于您正在抓取数百万个站点,因此您将受到单台机器功能的限制。我会考虑使用像 PiCloud [1] 或 Amazon Web Services (EC2) 这样的服务来将您的工作负载分布在许多服务器上。

一个例子是通过 Cloud Queues [2] 收集您的数据。

[1] http://www.picloud.com

[2] http://blog.picloud.com/2013/04/03/introducing-queues-creating-a-pipeline-in-the-cloud/

于 2013-04-26T08:54:31.747 回答