python - Threading with Hadoop Streaming

Question

I am making use of Hadoop streaming to write a python based HTML grabber. I find that running a single threaded python script is slow. I want to modify it to a multithreaded version. Does anyone know what would be a good number to set the number of threads in the mapper to. I am not sure of the specs of each node of the cluster but I assume that it would support atleast two threads.

score 0 · Accepted Answer

我没有为 html 抓取器使用 hadoop 流，但这里有一篇文章讨论 urllib2 如何使用多线程（不是多处理包，只是简单的多线程）工作。

希望能有所帮助。

score 0 · Accepted Answer

我尝试在 python 中使用线程，全局解释器锁存在问题。移植代码以使用多处理模块，内部 hadoop 分配与集群中的核心一样多的映射器，因此如果您需要加速，多处理不是要走的路。如果执行得当，多线程可能会带来一些加速

python - Threading with Hadoop Streaming

2 回答 2

Related

Reference