0

I am making use of Hadoop streaming to write a python based HTML grabber. I find that running a single threaded python script is slow. I want to modify it to a multithreaded version. Does anyone know what would be a good number to set the number of threads in the mapper to. I am not sure of the specs of each node of the cluster but I assume that it would support atleast two threads.

4

2 回答 2

0

我没有为 html 抓取器使用 hadoop 流,但这里有一篇文章讨论 urllib2 如何使用多线程(不是多处理包,只是简单的多线程)工作。

希望能有所帮助。

于 2013-10-12T15:05:07.937 回答
0

我尝试在 python 中使用线程,全局解释器锁存在问题。移植代码以使用多处理模块,内部 hadoop 分配与集群中的核心一样多的映射器,因此如果您需要加速,多处理不是要走的路。如果执行得当,多线程可能会带来一些加速

于 2013-08-15T00:01:59.833 回答