python - 如何在不使用 map reduce 的情况下使用 Python 编写 wordcount 程序

Question

实际上我是hadoop和python的新手......所以我的疑问是如何在hadoop中运行python脚本......而且我正在使用python编写一个wordcount程序......所以，我们可以在没有的情况下执行这个脚本使用地图减少....实际上我写的代码我可以看到输出如下黑暗1天堂2它3光4年龄5年龄6所有7所有8权威9之前10之前11是12信念13最好14比较15度数 16 绝望 17 直接 18 直接 19

It is counting number of words in a list..but whati have to achieve is grouping and deleting the duplicates and also count number of times of its occurrences  ..... 

Below is my code . can somebody please tell me where i have done the mistake

********************************************************
   Wordcount.py
********************************************************

import urllib2
import random
from operator import itemgetter

current_word = {}
current_count = 0
story = 'http://sixty-north.com/c/t.txt'
request = urllib2.Request(story)
response = urllib2.urlopen(request)
each_word = []
words = None
count = 1
same_words ={}
word = []
""" looping the entire file """
for line in response:
    line_words = line.split()
    for word in line_words:  # looping each line and extracting words
        each_word.append(word)
        random.shuffle(each_word)
        Sort_word = sorted(each_word)
for words in Sort_word:
    same_words = words.lower(),int(count)
    #print same_words
    #print words
    if not words in current_word :
        current_count = current_count +1
        print '%s\t%s' % (words, current_count)
    else:
        current_count = 1
        #if Sort_word == words.lower():
            #current_count += count
current_count = count
current_word = word
        #print '2. %s\t%s' % (words, current_count)

score 0 · Accepted Answer

要运行基于 python 的 MR 任务，请查看：

http://hadoop.apache.org/docs/r1.1.2/streaming.html http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

您需要根据 Mapper - Reducer 来设计您的代码，以使 Hadoop 能够执行您的 Python 脚本。在开始编写代码之前，请阅读 Map-Reduce Programming Paradigm。了解 MR 编程范式以及 {Key , value } 对在解决问题中的作用很重要。

#Modified your above code to generate the required output
import urllib2
import random
from operator import itemgetter

current_word = {}
current_count = 0
story = 'http://sixty-north.com/c/t.txt'
request = urllib2.Request(story)
response = urllib2.urlopen(request)
each_word = []
words = None
count = 1
same_words ={}
word = []
""" looping the entire file """
#Collect All the words into a list
for line in response:
    #print "Line = " , line
    line_words = line.split()
    for word in line_words:  # looping each line and extracting words
        each_word.append(word)

#for every word collected, in dict same_words
#if a key exists, such that key == word then increment Mapping Value by 1
# Else add word as new key with mapped value as 1
for words in each_word:
    if words.lower() not in same_words.keys() :
        same_words[words.lower()]=1
    else:
        same_words[words.lower()]=same_words[words.lower()]+1

for each in same_words.keys():
    print "word = ",each, ", count = ",same_words[each]

python - 如何在不使用 map reduce 的情况下使用 Python 编写 wordcount 程序

1 回答 1

Related

Reference