I know this benchmark is maybe meaningless. But I would like to give a simple comparison for run speed of Scala, Ruby and Perl.
To tell the results directly: for this job, Perl is the fastest, taking 1.9s. Ruby is the second fast, taking 3.0s. Scala script is the slowest, taking 4.0s.
Two input data used by the scripts can be downloaded form here:
words.txt.tgz (11MB)
stopwords.txt.tgz (4KB)
Here is the Scala script:
import scala.io.Source
val li = Source.fromFile("words.txt").getLines()
val set_sw = Source.fromFile("stopwords.txt").getLines().toSet
val hash = scala.collection.mutable.Map[String,Int]()
for (x <- li) {
if ( ! set_sw.contains(x) ) {
if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
}
}
val sorted = hash.toList.sortBy(-_._2)
sorted.take(20).foreach {println}
Here is the Ruby script:
stopwords = {}
File.open("stopwords.txt").each_line do |s|
s.strip!
stopwords[s] =1
end
count = {}
File.open("words.txt").each_line do |s|
s.strip!
if ! stopwords.has_key?(s)
if count.has_key?(s)
count[s] += 1
else
count[s] = 1
end
end
end
z = count.sort {|a1,a2| a2[1]<=>a1[1]}
z.take(20).each do |s| puts "#{s[0]} -> #{s[1]}" end
Here is the Perl script:
use strict;
my %stopwords;
open HD,"stopwords.txt" or die $!;
while(<HD>) {
chomp;
$stopwords{$_} =1;
}
close HD;
my %count;
open HD,"words.txt" or die $!;
while(<HD>) {
chomp;
unless ( $stopwords{$_} ) {
$count{$_} ++;
}
}
close HD;
my $i=0;
for (sort {$count{$b} <=> $count{$a}} keys %count) {
if ($i < 20) {
print "$_ -> $count{$_}\n"
} else {
last;
}
$i ++;
}
The basic idea of above scripts are the same. The difference is I use Set structure in Scala for keeping stopwords, but in Perl and Ruby I use Hash structure for stopwords.
And this is Scala’s run result:
$ time scala scala-set.sc
(send,20987)
(message,17516)
(unsubscribe,15541)
(2021,15221)
(list,13017)
(mailing,12402)
(mail,11647)
(file,11133)
(flink,10114)
(email,9919)
(pm,9248)
(group,8865)
(problem,8853)
(code,8659)
(data,8657)
(2020,8398)
(received,8246)
(google,7921)
(discussion,7920)
(jan,7893)
real 0m4.096s
user 0m6.725s
sys 0m0.187s
This is Ruby’s run result:
$ time ruby ruby-hash.rb
send -> 20987
message -> 17516
unsubscribe -> 15541
2021 -> 15221
list -> 13017
mailing -> 12402
mail -> 11647
file -> 11133
flink -> 10114
email -> 9919
pm -> 9248
group -> 8865
problem -> 8853
code -> 8659
data -> 8657
2020 -> 8398
received -> 8246
google -> 7921
discussion -> 7920
jan -> 7893
real 0m3.062s
user 0m3.028s
sys 0m0.032s
The final is Perl’s run result:
$ time perl perl-hash.pl
send -> 20987
message -> 17516
unsubscribe -> 15541
2021 -> 15221
list -> 13017
mailing -> 12402
mail -> 11647
file -> 11133
flink -> 10114
email -> 9919
pm -> 9248
group -> 8865
problem -> 8853
code -> 8659
data -> 8657
2020 -> 8398
received -> 8246
google -> 7921
discussion -> 7920
jan -> 7893
real 0m1.924s
user 0m1.893s
sys 0m0.029s
I have run the above three scripts many times. Their results are similar.
Version for the languages:
$ ruby -v
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux-gnu]
$ perl -v
This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-linux-gnu-thread-multi
(with 71 registered patches, see perl -V for more detail)
Copyright 1987-2017, Larry Wall
$ scala -version
Scala code runner version 2.13.7 -- Copyright 2002-2021, LAMP/EPFL and Lightbend, Inc.
The OS is ubuntu 18.04 for a KVM VPS. Hardware includes 4G ram, 40G ssd disk, double AMD 7302 processors.
I am surprised to see Perl has that fast speed among these three languages. Though I maybe have not written the best Ruby or Scala program for performance stuff, but this simple testing still shows Perl language has big performance advantages on the common text parsing jobs.
[updated 2022-01-29] Below is the updated content:
After I compiled the scala script, the running time becomes much shorter. So I was thinking the reason for the slow scala script above is the parser starts up too slow.
Scala script changed to this:
import scala.io.Source
object CountWords {
def main(args: Array[String]):Unit = {
val li = Source.fromFile("words.txt").getLines()
val stopwords = Source.fromFile("stopwords.txt").getLines().toSet
val hash = scala.collection.mutable.Map[String,Int]()
for (x <- li) {
if ( ! stopwords.contains(x) ) {
if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
}
}
hash.toList
.sortBy(-_._2)
.take(20)
.foreach {println}
}
}
And compiled it with:
$ scalac CountWords.scala
Here is the comparison of running speed to perl:
$ time scala CountWords
(send,21919)
(message,19347)
(unsubscribe,16617)
(2021,15344)
(list,14271)
(mailing,13098)
(file,12537)
(mail,12122)
(jan,12070)
(email,10701)
(flink,10249)
(pm,9940)
(code,9562)
(group,9547)
(problem,9536)
(data,9373)
(2022,8932)
(received,8760)
(return,8566)
(discussion,8441)
real 0m2.107s
user 0m2.979s
sys 0m0.142s
$ time perl perl-hash.pl
send -> 21919
message -> 19347
unsubscribe -> 16617
2021 -> 15344
list -> 14271
mailing -> 13098
file -> 12537
mail -> 12122
jan -> 12070
email -> 10701
flink -> 10249
pm -> 9940
code -> 9562
group -> 9547
problem -> 9536
data -> 9373
2022 -> 8932
received -> 8760
return -> 8566
discussion -> 8441
real 0m2.418s
user 0m2.380s
sys 0m0.036s
Now, perl run with 2.4s, while scala run with 2.1s, the latter is faster.
For this simple comparison, the running speed is finally with this order:
compiled scala > perl > ruby > scala script