Given the following dataset:
$ head -10 fruits.txt
peach 1
apricot 2
apple 3
haw 1
persimmon 9
orange 2
litchi 9
orange 5
jackfruit 8
crab apple 0
We can group and aggregate to count the fruit’s number by their names.
In Scala it’s quite easy because this language has the built-in higher order functions for calculation. The implementation in Scala REPL:
scala> import scala.io.Source
import scala.io.Source
scala> val fd = Source.fromFile("tmp/fruits.txt").getLines().toList
scala> val regex = """(.+)\s+(\d+)""".r
val regex: scala.util.matching.Regex = (.+)\s+(\d+)
scala> fd.map{ case regex(x,y) => (x,y.toInt) }.groupMapReduce(_._1)(_._2)(_+_).toList.sortBy(-_._2).foreach(println)
(blueberry,166)
(cumquat,145)
(lemon,145)
(areca nut,139)
(haw,137)
(banana,134)
(greengage,134)
(raspberry,133)
(longan,132)
(nectarine,128)
(flat peach,125)
(tangerine,122)
(blackberry,121)
(litchi,120)
(watermelon,120)
(peach,117)
(bitter orange,117)
(strawberry,116)
(pawpaw papaya,113)
(persimmon,108)
(orange,104)
(tomato,102)
(mango,102)
(plum,99)
(wax apple,98)
(waxberry red bayberry,95)
(grape,92)
(jackfruit,92)
(pineapple,91)
(pear,89)
(loquat,89)
(coconut,86)
(apple,86)
(pomegranate,84)
(shaddock pomelo,84)
(musk melon,84)
(guava,82)
(honey peach,82)
(apricot,81)
(starfruit,80)
(sugar cane,79)
(crab apple,75)
(cherry,73)
Ruby also implements this job well by using its Array methods. Though Ruby has no built-in groupMapReduce method, but I have implemented one for which you can get from this repo. The code below is run in Ruby’s interactive shell (irb).
irb(main):001:0> require './bitfox'
=> true
irb(main):002:0> file = File.readlines('tmp/fruits.txt')
irb(main):011:-> file.map {|s| (x,y) = s.chomp.split("\t"); [x,y.to_i] }.reduceByKey {|x,y| x+y}.sort_by{ |s| -s[1] }
=>
[["blueberry", 166],
["cumquat", 145],
["lemon", 145],
["areca nut", 139],
["haw", 137],
["greengage", 134],
["banana", 134],
["raspberry", 133],
["longan", 132],
["nectarine", 128],
["flat peach", 125],
["tangerine", 122],
["blackberry", 121],
["litchi", 120],
["watermelon", 120],
["peach", 117],
["bitter orange", 117],
["strawberry", 116],
["pawpaw papaya", 113],
["persimmon", 108],
["orange", 104],
["tomato", 102],
["mango", 102],
["plum", 99],
["wax apple", 98],
["waxberry red bayberry", 95],
["jackfruit", 92],
["grape", 92],
["pineapple", 91],
["pear", 89],
["loquat", 89],
["apple", 86],
["coconut", 86],
["pomegranate", 84],
["shaddock pomelo", 84],
["musk melon", 84],
["honey peach", 82],
["guava", 82],
["apricot", 81],
["starfruit", 80],
["sugar cane", 79],
["crab apple", 75],
["cherry", 73]]
In my implementation above, Ruby uses the method reduceByKey instead of groupMapReduce in Scala. That’s to say, Ruby and Spark use the method name “reduceBykey”, while Scala uses the name “groupMapReduce”. All three methods have the same effect.
The last is Spark’s implementation. It’s quite simple to run this statistics in spark-shell.
scala> val rdd = sc.textFile("tmp/fruits.txt")
rdd: org.apache.spark.rdd.RDD[String] = tmp/fruits.txt MapPartitionsRDD[1] at textFile at <console>:23
scala> val regex = """(.+)\s+(\d+)""".r
regex: scala.util.matching.Regex = (.+)\s+(\d+)
scala> rdd.map{ case regex(x,y) => (x,y.toInt) }.reduceByKey( _+_ ).sortBy(-_._2).foreach(println)
(blueberry,166)
(cumquat,145)
(lemon,145)
(areca nut,139)
(haw,137)
(banana,134)
(greengage,134)
(raspberry,133)
(longan,132)
(nectarine,128)
(flat peach,125)
(tangerine,122)
(blackberry,121)
(watermelon,120)
(litchi,120)
(bitter orange,117)
(peach,117)
(strawberry,116)
(pawpaw papaya,113)
(persimmon,108)
(orange,104)
(tomato,102)
(mango,102)
(plum,99)
(wax apple,98)
(waxberry red bayberry,95)
(grape,92)
(jackfruit,92)
(pineapple,91)
(pear,89)
(loquat,89)
(apple,86)
(coconut,86)
(musk melon,84)
(shaddock pomelo,84)
(pomegranate,84)
(guava,82)
(honey peach,82)
(apricot,81)
(starfruit,80)
(sugar cane,79)
(crab apple,75)
(cherry,73)
It’s not hard to understand how groupMapReduce is implemented. If you have checked my code in Github, you will find the code is little.
def reduceByKey
if block_given?
group_by {|x| x[0]}.map {|x,y| y.reduce {|x,y| [ x[0], yield(x[1],y[1]) ]}}
else
raise "no block given"
end
end
Perl and Python can implement this method as well. But they are not that convenient. For data programming, Scala and Spark are really the best. If no considering the performance, Ruby is fun too.