[转]Writing an Hadoop MapReduce Program in Python

JessYanCoding 发布于2019-07-31 11:34 / 1125人阅读

mapper.py

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def read_input(file):
    for line in file:
        # split the line into words
        yield line.split()

def main(separator="	"):
    # input comes from STDIN (standard input)
    data = read_input(sys.stdin)
    for words in data:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        for word in words:
            print "%s%s%d" % (word, separator, 1)

if __name__ == "__main__":
    main()

reducer.py

#!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""

from itertools import groupby
from operator import itemgetter
import sys

def read_mapper_output(file, separator="	"):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator="	"):
    # input comes from STDIN (standard input)
    data = read_mapper_output(sys.stdin, separator=separator)
    # groupby groups multiple word-count pairs by word,
    # and creates an iterator that returns consecutive keys and their group:
    #   current_word - string containing a word (the key)
    #   group - iterator yielding all ["", ""] items
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            total_count = sum(int(count) for current_word, count in group)
            print "%s%s%d" % (current_word, separator, total_count)
        except ValueError:
            # count was not a number, so silently discard this item
            pass

if __name__ == "__main__":
    main()

转自：http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/45308.html

Awesome Python II

摘要： Caching Libraries for caching data. Beaker - A library for caching and sessions for use with web applications and stand-alone Python scripts and applications. dogpile.cache - dogpile.cache...

lx1036 2019-07-31 11:36 评论0 收藏0

发表评论

登陆后可评论

0条评论

JessYanCoding

男|高级讲师

我要关注我要私信

TA的文章

#黑五#DediPath：洛杉矶vps/圣何塞vps/凤凰城vps/西雅图vps，kvm vps赴约

阅读 3127·2021-11-24 10:34
租用香港服务器选择什么操作系统好？这4种系统皆可！

阅读 3394·2021-11-22 13:53
看动画学算法之:hashtable

阅读 2681·2021-11-22 12:03
高级驱动——（驱动所有按键）

阅读 3658·2021-09-26 09:47
内存分配者-动态内存

阅读 3059·2021-09-23 11:21
主机名怎么查-手机无线网主机名如何看？

阅读 4938·2021-09-22 15:08
raksmart：美国圣何塞机房cn2 only线路独立服务网简单测评，三网cn2 gia速度快！

阅读 3382·2021-07-23 10:59
内联元素的盒模型 display和visibility overflow 文档流

阅读 1312·2019-08-29 18:31

资讯专栏INFORMATION COLUMN

上云采购季！| 2核2G4M爆款云服务器低至59元/年，更有多台、长期优惠，快来选购！

[转]Writing an Hadoop MapReduce Program in Python

相关文章

**Awesome Python II**

发表评论

0条评论

JessYanCoding

男|高级讲师

TA的文章

#黑五#DediPath：洛杉矶vps/圣何塞vps/凤凰城vps/西雅图vps，kvm vps赴约

租用香港服务器选择什么操作系统好？这4种系统皆可！

看动画学算法之:hashtable

高级驱动——（驱动所有按键）

内存分配者-动态内存

主机名怎么查-手机无线网主机名如何看？

raksmart：美国圣何塞机房cn2 only线路独立服务网简单测评，三网cn2 gia速度快！

内联元素的盒模型 display和visibility overflow 文档流

最新活动