Why mmap has no significant effect in my code

In my last blog there was a small benchmark for the languages of scala, perl and ruby.

Many geeks on the lists have pointed out the problems and gave the suggestions for improvement. I really appreciate them.

Why I want to benchmark them? because we have the real production which is using the similar counting technology. In the production, the input is the streaming. That means, every second there are the input words coming into the service. The data are coming from a message queue like Kafka or Rabbitmq or whatever. Then, we run the realtime computing on Spark. Spark reads the streaming and then run the DSL syntax for filtering and counting.

Here is the similar syntax on spark:

rdd=spark.readStream.format("socket")... # similar to words.txt
df=spark.createDataFrame(rdd.map(lambda x:(x,1)),["word","count"])
df2=df.filter(~col("word").isin(stopwords)) # filter out stopwords
df2.select("*").groupBy("word").count().orderBy("count",ascending=False).show(20)

Thanks to Paul who points out that, I can use mmap() to speed up the file read. I have followed Paul’s code to test again, but it’s strange mmap() have no special values to me. Here I give the comparison.

The common-read version for perl:

use strict;

my %stopwords;

open HD,"stopwords.txt" or die $!;
while(<HD>) {
    chomp;
    $stopwords{$_} =1;
}
close HD;

my %count;

open HD,"words.txt" or die $!;
while(<HD>) {
    chomp;
    unless ( $stopwords{$_} ) {
        $count{$_} ++;
    }
}
close HD;

my $i=0;
for (sort {$count{$b} <=> $count{$a}} keys %count) {
    if ($i < 20) {
        print "$_ -> $count{$_}\n"
    } else {
       last; 
    }
    $i ++;
}

The mmap version for perl:

use strict;

my %stopwords;

open my $fh, '<:mmap', 'stopwords.txt' or die $!;
while(<$fh>) {
    chomp;
    $stopwords{$_} =1;
}
close $fh;

my %count;

open my $fh, '<:mmap', 'words.txt' or die $!;
while(<$fh>) {
    chomp;
    unless ( $stopwords{$_} ) {
        $count{$_} ++;
    }
}
close $fh;

my $i=0;
for (sort {$count{$b} <=> $count{$a}} keys %count) {
    if ($i < 20) {
        print "$_ -> $count{$_}\n"
    } else {
       last; 
    }
    $i ++;
}

The common-read version for ruby:

stopwords = {}
File.open('stopwords.txt').each_line do |s|
  s.chomp!
  stopwords[s] = 1
end

count = Hash.new(0)
File.open('words.txt').each_line do |s|
  s.chomp!
  count[s] += 1 unless stopwords[s]
end

count.sort_by{|_,c| -c}.take(20).each do |s|
  puts "#{s[0]} -> #{s[1]}"
end

The mmap version for ruby:

require 'mmap'

stopwords = {}
mmap_s = Mmap.new('stopwords.txt')
mmap_s.advise(Mmap::MADV_SEQUENTIAL)
mmap_s.each_line do |s|
  s.chomp!
  stopwords[s] = 1
end

count = Hash.new(0)
mmap_c = Mmap.new('words.txt')
mmap_c.advise(Mmap::MADV_SEQUENTIAL)
mmap_c.each_line do |s|
  s.chomp!
  count[s] += 1 unless stopwords[s]
end

count.sort_by{|_,c| -c}.take(20).each do |s|
  puts "#{s[0]} -> #{s[1]}"
end

The code body of ruby was optimized by Frank, thanks.

So, this is the comparison for perl (the first is the common version, the second is mmap version):

$ time perl perl-hash.pl 
send -> 20987
message -> 17516
unsubscribe -> 15541
2021 -> 15221
list -> 13017
mailing -> 12402
mail -> 11647
file -> 11133
flink -> 10114
email -> 9919
pm -> 9248
group -> 8865
problem -> 8853
code -> 8659
data -> 8657
2020 -> 8398
received -> 8246
google -> 7921
discussion -> 7920
jan -> 7893

real	0m2.018s
user	0m2.003s
sys	0m0.012s

$ time perl perl-mmap.pl 
send -> 20987
message -> 17516
unsubscribe -> 15541
2021 -> 15221
list -> 13017
mailing -> 12402
mail -> 11647
file -> 11133
flink -> 10114
email -> 9919
pm -> 9248
group -> 8865
problem -> 8853
code -> 8659
data -> 8657
2020 -> 8398
received -> 8246
google -> 7921
discussion -> 7920
jan -> 7893

real	0m1.905s
user	0m1.888s
sys	0m0.016s

And, this is the comparison for ruby (the first is the common version, the second is mmap version):

$ time ruby ruby-hash.rb 
send -> 20987
message -> 17516
unsubscribe -> 15541
2021 -> 15221
list -> 13017
mailing -> 12402
mail -> 11647
file -> 11133
flink -> 10114
email -> 9919
pm -> 9248
group -> 8865
problem -> 8853
code -> 8659
data -> 8657
2020 -> 8398
received -> 8246
google -> 7921
discussion -> 7920
jan -> 7893

real	0m2.690s
user	0m2.660s
sys	0m0.028s

$ time ruby ruby-mmap.rb 
send -> 20987
message -> 17516
unsubscribe -> 15541
2021 -> 15221
list -> 13017
mailing -> 12402
mail -> 11647
file -> 11133
flink -> 10114
email -> 9919
pm -> 9248
group -> 8865
problem -> 8853
code -> 8659
data -> 8657
2020 -> 8398
received -> 8246
google -> 7921
discussion -> 7920
jan -> 7893

real	0m2.695s
user	0m2.689s
sys	0m0.004s

I have run the above comparison for many times. The results were similar. My case shows there is no visible speed improvement by using mmap() for reading files.

The OS is ubuntu 18.04 x86_64, running on KVM VPS. There are 4GB dedicated ram, double AMD 7302 cores, 40GB ssd disk. Ruby version 2.5.1, Perl version 5.26.1.

Perl has its built-in mmap support. Ruby has mmap library installed by this way:

sudo apt install ruby-mmap2

Why this happens? I will continue to research with the problem.

Benchmark for Scala, Ruby and Perl

I know this benchmark is maybe meaningless. But I would like to give a simple comparison for run speed of Scala, Ruby and Perl.

To tell the results directly: for this job, Perl is the fastest, taking 1.9s. Ruby is the second fast, taking 3.0s. Scala script is the slowest, taking 4.0s.

Two input data used by the scripts can be downloaded form here:

words.txt.tgz (11MB)

stopwords.txt.tgz (4KB)

Here is the Scala script:

import scala.io.Source

val li = Source.fromFile("words.txt").getLines()
val set_sw = Source.fromFile("stopwords.txt").getLines().toSet
val hash = scala.collection.mutable.Map[String,Int]()

for (x <- li) {
    if ( ! set_sw.contains(x) ) {
      if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
    }
}

val sorted = hash.toList.sortBy(-_._2)
sorted.take(20).foreach {println}

Here is the Ruby script:

stopwords = {}
File.open("stopwords.txt").each_line do |s|
  s.strip!
  stopwords[s] =1
end

count = {}
File.open("words.txt").each_line do |s|
  s.strip!
  if ! stopwords.has_key?(s)
    if count.has_key?(s) 
       count[s] += 1
    else
       count[s] = 1
    end
  end
end
      
z = count.sort {|a1,a2| a2[1]<=>a1[1]}
z.take(20).each do |s| puts "#{s[0]} -> #{s[1]}" end

Here is the Perl script:

use strict;

my %stopwords;

open HD,"stopwords.txt" or die $!;
while(<HD>) {
    chomp;
    $stopwords{$_} =1;
}
close HD;

my %count;

open HD,"words.txt" or die $!;
while(<HD>) {
    chomp;
    unless ( $stopwords{$_} ) {
        $count{$_} ++;
    }
}
close HD;

my $i=0;
for (sort {$count{$b} <=> $count{$a}} keys %count) {
    if ($i < 20) {
        print "$_ -> $count{$_}\n"
    } else {
       last; 
    }
    $i ++;
}

The basic idea of above scripts are the same. The difference is I use Set structure in Scala for keeping stopwords, but in Perl and Ruby I use Hash structure for stopwords.

And this is Scala’s run result:

$ time scala scala-set.sc 
(send,20987)
(message,17516)
(unsubscribe,15541)
(2021,15221)
(list,13017)
(mailing,12402)
(mail,11647)
(file,11133)
(flink,10114)
(email,9919)
(pm,9248)
(group,8865)
(problem,8853)
(code,8659)
(data,8657)
(2020,8398)
(received,8246)
(google,7921)
(discussion,7920)
(jan,7893)

real	0m4.096s
user	0m6.725s
sys	0m0.187s

This is Ruby’s run result:

$ time ruby ruby-hash.rb 
send -> 20987
message -> 17516
unsubscribe -> 15541
2021 -> 15221
list -> 13017
mailing -> 12402
mail -> 11647
file -> 11133
flink -> 10114
email -> 9919
pm -> 9248
group -> 8865
problem -> 8853
code -> 8659
data -> 8657
2020 -> 8398
received -> 8246
google -> 7921
discussion -> 7920
jan -> 7893

real	0m3.062s
user	0m3.028s
sys	0m0.032s

The final is Perl’s run result:

$ time perl perl-hash.pl 
send -> 20987
message -> 17516
unsubscribe -> 15541
2021 -> 15221
list -> 13017
mailing -> 12402
mail -> 11647
file -> 11133
flink -> 10114
email -> 9919
pm -> 9248
group -> 8865
problem -> 8853
code -> 8659
data -> 8657
2020 -> 8398
received -> 8246
google -> 7921
discussion -> 7920
jan -> 7893

real	0m1.924s
user	0m1.893s
sys	0m0.029s

I have run the above three scripts many times. Their results are similar.

Version for the languages:

$ ruby -v
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux-gnu]

$ perl -v
This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-linux-gnu-thread-multi
(with 71 registered patches, see perl -V for more detail)

Copyright 1987-2017, Larry Wall

$ scala -version
Scala code runner version 2.13.7 -- Copyright 2002-2021, LAMP/EPFL and Lightbend, Inc.

The OS is ubuntu 18.04 for a KVM VPS. Hardware includes 4G ram, 40G ssd disk, double AMD 7302 processors.

I am surprised to see Perl has that fast speed among these three languages. Though I maybe have not written the best Ruby or Scala program for performance stuff, but this simple testing still shows Perl language has big performance advantages on the common text parsing jobs.

[updated 2022-01-29] Below is the updated content:

After I compiled the scala script, the running time becomes much shorter. So I was thinking the reason for the slow scala script above is the parser starts up too slow.

Scala script changed to this:

import scala.io.Source

object CountWords {
  def main(args: Array[String]):Unit = {

    val li = Source.fromFile("words.txt").getLines()
    val stopwords = Source.fromFile("stopwords.txt").getLines().toSet
    val hash = scala.collection.mutable.Map[String,Int]()

    for (x <- li) {
        if ( ! stopwords.contains(x) ) {
            if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
        }
    }

    hash.toList
     .sortBy(-_._2)
     .take(20)
     .foreach {println}
  }
}

And compiled it with:

$ scalac CountWords.scala

Here is the comparison of running speed to perl:

$ time scala CountWords
(send,21919)
(message,19347)
(unsubscribe,16617)
(2021,15344)
(list,14271)
(mailing,13098)
(file,12537)
(mail,12122)
(jan,12070)
(email,10701)
(flink,10249)
(pm,9940)
(code,9562)
(group,9547)
(problem,9536)
(data,9373)
(2022,8932)
(received,8760)
(return,8566)
(discussion,8441)

real	0m2.107s
user	0m2.979s
sys	0m0.142s

$ time perl perl-hash.pl 
send -> 21919
message -> 19347
unsubscribe -> 16617
2021 -> 15344
list -> 14271
mailing -> 13098
file -> 12537
mail -> 12122
jan -> 12070
email -> 10701
flink -> 10249
pm -> 9940
code -> 9562
group -> 9547
problem -> 9536
data -> 9373
2022 -> 8932
received -> 8760
return -> 8566
discussion -> 8441

real	0m2.418s
user	0m2.380s
sys	0m0.036s

Now, perl run with 2.4s, while scala run with 2.1s, the latter is faster.

For this simple comparison, the running speed is finally with this order:

compiled scala > perl > ruby > scala script

Install RabbitMQ 3.9 on Ubuntu 16.04

I have an old OS ubuntu 16.04, where I plan to install RabbitMQ 3.9.

Before installing RMQ you should have Erlang installed first. RMQ 3.9 requires the latest version of Erlang 24.

I followed this guide on packagecloud:

https://packagecloud.io/rabbitmq/erlang/install

It doesn’t work since packagecloud has lost the release source for ubuntu 16.04.

So I google and found this URL:

How to Install Erlang on Ubuntu 18.04 & 16.04 LTS

which does work fine. So the steps I took are:

$ wget https://packages.erlang-solutions.com/erlang-solutions_1.0_all.deb
$ sudo dpkg -i erlang-solutions_1.0_all.deb
$ sudo apt update
$ sudo apt install erlang

After them Erlang 24 was installed successfully on my ubuntu 16.04.

Then we will install the latest RabbitMQ. It is quite easy following packagecloud’s guide:

https://packagecloud.io/rabbitmq/rabbitmq-server/install

What I did is:

$ curl -s https://packagecloud.io/install/repositories/rabbitmq/rabbitmq-server/script.deb.sh | sudo bash
$ sudo apt update
$ sudo apt install rabbitmq-server

Then RabbitMQ 3.9 was installed successfully on the system (ubuntu 16.04 LTS).

Let’s check the software status:

$ sudo rabbitmqctl status
Status of node rabbit@AffectionateDarkmagenta-VM ...
Runtime

OS PID: 2729
OS: Linux
Uptime (seconds): 1945
Is under maintenance?: false
RabbitMQ version: 3.9.11
Node name: rabbit@AffectionateDarkmagenta-VM
Erlang configuration: Erlang/OTP 24 [erts-12.1.5] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:1]
Erlang processes: 266 used, 1048576 limit
Scheduler run queue: 1
Cluster heartbeat timeout (net_ticktime): 60
......

As you see it’s Erlang/OTP 24 and RabbitMQ 3.9.11 running there. We have done the good job.

Many factors impact the throughput of RabbitMQ

Many factors impact the throughput you can expect:

Network latency and throughput
Erlang 24 provides 30-40% better performance usually
Disks may be the bottleneck
You need sufficient CPU (at least 2, so that the queue process can use one and there is still at least one more for other processes)
Message size of course (500 bytes is perfectly reasonable size for a message but I believe these tests were using smaller size)
The application itself (you can use https://github.com/rabbitmq/rabbitmq-perf-test which is our go-to testing/benchmarking tool)
And then there is still additional tuning that could be performed (Erlang flags, TCP buffers and many other things)

Also, if you need every last bit of the performance you can get from RabbitMQ then you should probably consider some alternative design choices. Options include:

Using multiple queues
Using sharding or consistent hash exchange
Using a stream instead: https://rabbitmq.com/stream.html

Why this DMARC pass by google?

I wrote a message to the mailing list (google groups) with plain text format. The content was just one sentence and “thank you” as end.

google groups will forward the message to everybody in the list. When forwarding, google will does a SRS (sender rewrite). So SPF has no contribution to DMARC validation in recipients. In this case, I use another gmail as test recipient.

And, google modified the message body, adding a signature at end. See screenshot below. The green part is my original content. The red part is the signature adding by google. So DKIM will break in the recipients.

What strange is, this message was labeled by google as “DMARC PASS”, see screenshot below.

My SPF, DKIM, DMARC settings:

linuxdeveloper.xyz. 299 IN TXT "v=spf1 include:spf.migadu.com ?all"

_dmarc.linuxdeveloper.xyz. 299 IN TXT "v=DMARC1; p=none;"

key1._domainkey.linuxdeveloper.xyz. 299 IN CNAME key1.linuxdeveloper.xyz._domainkey.migadu.com.

key2._domainkey.linuxdeveloper.xyz. 299 IN CNAME key2.linuxdeveloper.xyz._domainkey.migadu.com.

key3._domainkey.linuxdeveloper.xyz. 299 IN CNAME key3.linuxdeveloper.xyz._domainkey.migadu.com.

I can’t understand why this DMARC got passed by google. Both DKIM and SPF have no contribution to DMARC validation.

I asked this question on postfix’s mailing list. Thanks to our friend @Raf who gave the wonderful explanation. I quote his writing below.

Warning: This is just a theory, but it's the only
reasonable one I could think of.

Google is aware of the fragility of SPF/DKIM/DMARC when
it comes to mailing lists, which is why they use ARC:

  Authenticated Received Chain (ARC) Protocol
  https://tools.ietf.org/html/rfc8617 (Experimental)

ARC is a way for remailers to add an authenticated
chain of custody to an email, where they check
SPF/DKIM/DMARC when they receive the original email,
and then attest that each check passed or failed at
that time, and then they provide a DKIM-like signature
to prove that it was really them that made the
attestation.

If you look in the headers of a googlemail email,
you'll see these headers:

  ARC-Seal
  ARC-Message-Signature
  ARC-Authentication-Results

There can be a set of these three headers for every
ARC-enabled remailer along the path. The googlegroups
email that I receive tends to have two sets, both added
as the mail passes between various google servers.

The ARC-Authentication-Results header contains the
SPF/DKIM/DMARC check results for the original mail, and
this gets copied up through the chain. The other two
headers in each set enable the receiver to check the
authenticity of its contents.

Gmail is probably checking the ARC chain and seeing
that DMARC was valid when googlegroups received the
original email, and that's what gmail is reporting to
you as a DMARC pass.

I'm not sure how much ARC is used. From my tiny
personal sample set, it's almost all Google and
Microsoft. So maybe that's a lot. And who checks it?
It's hard to tell. If gmail checks ARC but doesn't
mention it by name, perhaps other mail providers are
doing that too.

There is a milter for it called OpenARC, written by the
same group that wrote OpenDKIM and OpenDMARC, but it
seems to have been abandoned two years ago when it was
still in beta stage. And it doesn't get a mention in
the postfix setup tutorials that I've come across.
I can find people asking how to set it up, but not
so much in the way of satisfactory answers.

Without something like OpenARC, OpenDMARC is going to
produce lots of false positives because it doesn't know
to defer to ARC checking in the presence of ARC headers.

So ARC is probably needed for running the mailing lists. I will check them soon.

Tech Blog

Stuff about Big data, DevOps & Cloud

Why mmap has no significant effect in my code

Benchmark for Scala, Ruby and Perl

Install RabbitMQ 3.9 on Ubuntu 16.04

Many factors impact the throughput of RabbitMQ

Why this DMARC pass by google?