Here I provided a dataset for words appearing in the public tech lists.
The file is a text file, one line with one word. The total lines and size:
$ wc -l words.txt
283218160 words.txt
$ du -h words.txt
1.5G words.txt
You can download this file from here (tgz compressed).
I just did a test to count the words in this file by grouping and sorting, with three methods: spark RDD API, spark dataframe API, and the scala program.
This is the syntax of spark RDD API in pyspark:
>>> rdd = sc.textFile("/tmp/words.txt")
>>> rdd.map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).sortBy(lambda x:x[1],ascending=False).take(20)
This is the syntax of spark dataframe API in pyspark:
>>> df = spark.read.text("/tmp/words.txt")
>>> df.select("*").groupBy("value").count().orderBy("count",ascending=False).show()
This is the scala program:
import scala.io.Source
object BigWords extends App {
val file = Source.fromFile("/tmp/words.txt").getLines()
val hash = scala.collection.mutable.Map[String,Int]()
for (x <- file) {
if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
}
hash.toList.sortBy(-_._2).take(20).foreach(println)
}
I compiled the scala program and run it:
$ scalac bigwords.scala
$ time scala BigWords
(the,14218320)
(to,11045040)
(a,5677600)
(and,5205760)
(is,4972080)
(i,4447440)
(in,4228200)
(of,3982280)
(on,3899320)
(for,3760800)
(this,3684640)
(you,3485360)
(at,3238480)
(that,3230920)
(it,2925320)
(with,2181160)
(be,2172400)
(not,2124320)
(from,2097120)
(if,1993560)
real 0m49.031s
user 0m50.390s
sys 0m0.734s
As you see it takes 49s to finish the job.
While spark’s dataframe API is a bit slower, it takes 56s to finish the job (timer from my iOS stopwatch app):
+-----+--------+
|value| count|
+-----+--------+
| the|14218320|
| to|11045040|
| a| 5677600|
| and| 5205760|
| is| 4972080|
| i| 4447440|
| in| 4228200|
| of| 3982280|
| on| 3899320|
| for| 3760800|
| this| 3684640|
| you| 3485360|
| at| 3238480|
| that| 3230920|
| it| 2925320|
| with| 2181160|
| be| 2172400|
| not| 2124320|
| from| 2097120|
| if| 1993560|
+-----+--------+
only showing top 20 rows
But, spark RDD API is quite slow in my use case. It takes 7m15s to finish the job:
[('the', 14218320), ('to', 11045040), ('a', 5677600), ('and', 5205760), ('is', 4972080), ('i', 4447440), ('in', 4228200), ('of', 3982280), ('on', 3899320), ('for', 3760800), ('this', 3684640), ('you', 3485360), ('at', 3238480), ('that', 3230920), ('it', 2925320), ('with', 2181160), ('be', 2172400), ('not', 2124320), ('from', 2097120), ('if', 1993560)]
I doubt it’s due to the slow python parser, so I re-run the RDD API with spark’s scala shell. The syntax and results as below:
scala> val rdd = sc.textFile("/tmp/words.txt")
scala> rdd.map{(_,1)}.reduceByKey{_ + _}.sortBy{-_._2}.take(20)
res1: Array[(String, Int)] = Array((the,14218320), (to,11045040), (a,5677600), (and,5205760), (is,4972080), (i,4447440), (in,4228200), (of,3982280), (on,3899320), (for,3760800), (this,3684640), (you,3485360), (at,3238480), (that,3230920), (it,2925320), (with,2181160), (be,2172400), (not,2124320), (from,2097120), (if,1993560))
This time it takes 1m31s to finish the job.
The results summary in a table:
program | time |
scala program | 49s |
pyspark dataframe | 56s |
scala RDD | 1m31s |
pyspark RDD | 7m15s |
I am so surprised pyspark’s RDD API is too slow as this. I want to give a research on the case.
Application environment:
- OS: Ubuntu 18.04.6 LTS, a KVM instance
- Spark version: 3.2.0 (with scala 2.12.15, and python 3.6.9)
- Scala version: 2.13.7
- CPU: double AMD EPYC 7302
- Ram: 4GB dedicated
- Disk: 40GB SSD