python - Using sklearn and Python for a large application classification/scraping exercise

Question

I am working on a relatively large text-based web classification problem and I am planning on using the multinomial Naive Bayes classifier in sklearn in python and the scrapy framework for the crawling. However, I am a little concerned that sklearn/python might be too slow for a problem that could involve classifications of millions of websites. I have already trained the classifier on several thousand websites from DMOZ. The research framework is as follows:

1) The crawler lands on a domain name and scrapes the text from 20 links on the site (of depth no larger than one). (The number of tokenized words here seems to vary between a few thousand to up to 150K for a sample run of the crawler) 2) Run the sklearn multionmial NB classifier with around 50,000 features and record the domain name depending on the result

My question is whether a Python-based classifier would be up to the task for such a large scale application or should I try re-writing the classifier (and maybe the scraper and word tokenizer as well) in a faster environment? If yes what might that environment be? Or perhaps Python is enough if accompanied with some parallelization of the code? Thanks

score 5 · Accepted Answer

例如，使用支持APIHashingVectorizer的线性分类模块之一，或者增量学习模型，而无需预先矢量化和加载内存中的所有数据，并且在数亿个分类器上学习分类器应该没有任何问题。具有数十万（散列）特征的文档。partial_fitSGDClassifierPerceptronPassiveAggresiveClassifier

但是，您应该加载一个适合内存的小子样本（例如 100k 文档），并使用 Pipeline 对象和RandomizedSearchCV主分支的类为矢量化器网格搜索好的参数。RandomizedSearchCV您还可以使用适合内存的相同或更大的预矢量化数据集（例如几百万个文档）微调正则化参数的值（例如，C 表示 PassiveAggressiveClassifier 或 alpha 表示 SGDClassifier ）。

还可以对线性模型进行平均（平均2 个线性模型） coef_，intercept_以便您可以对数据集进行分区，独立学习线性模型，然后对模型进行平均以获得最终模型。

score 3 · Accepted Answer

从根本上说，如果您依赖 numpy、scipy 和 sklearn，Python 不会成为瓶颈，因为这些库的大多数关键部分都是作为 C 扩展实现的。

但是，由于您正在抓取数百万个站点，因此您将受到单台机器功能的限制。我会考虑使用像 PiCloud [1] 或 Amazon Web Services (EC2) 这样的服务来将您的工作负载分布在许多服务器上。

一个例子是通过 Cloud Queues [2] 收集您的数据。

[1] http://www.picloud.com

[2] http://blog.picloud.com/2013/04/03/introducing-queues-creating-a-pipeline-in-the-cloud/

python - Using sklearn and Python for a large application classification/scraping exercise

2 回答 2

Related

Reference