Perl’s regex is still very fast. Its running speed is amazing. Scala’s regex can work, but it’s as 3 times slower as Perl. Just got the result from my experience.
Here is the use case. Reads from a text file which is as big as gigs. Filters the lines with regex, and splits the line into words, then filters the words with another regex. Finally prints out the words.
This is perl script:
use strict;
open HDW,">","words.txt" or die $!;
open HD,"msg.txt" or die $!;
while(<HD>) {
next if /^[^0-9a-zA-Z\s]/;
chomp;
my @words = split/\s+/,$_;
for my $w (@words) {
$w=lc($w);
if ($w=~/^[a-z0-9]+$/ and length($w) < 30){
print HDW $w,"\n";
}
}
}
close HD;
close HDW;
This is scala script:
import scala.io.Source
val patt1 = """^[^0-9a-zA-Z\s].*$"""
val patt2 = """^[a-z0-9]+$"""
val lines = Source.fromFile("msg.txt").getLines().filter(! _.matches(patt1))
for (x <- lines) {
x.split("""\s+""").map(_.toLowerCase).filter(_.matches(patt2)).filter(_.size < 30).foreach {println}
}
Though scala is compiled as class, its executing time is 3 times to perl.
$ scalac -Xscript SplitWords words-parse.scala
$ time scala SplitWords > scala-words.txt
real 0m36.858s
user 0m25.494s
sys 0m13.449s
$ time perl words-parse.pl
real 0m12.115s
user 0m11.770s
sys 0m0.184s
And, I found a feature that, scala’s regex must be full matching, while perl’s can be part matching.
Such as this matching in scala gets false:
scala> val str = "hello word"
val str: String = hello word
scala> str.matches("^hello")
val res0: Boolean = false
But in perl it’s always true:
$ perl -le '$str ="hello word"; print "true" if $str=~ /^hello/'
true
Regardless of language features, doing the right thing with the right tool is always right.
[ Update 1 ]
Thanks to the guy on scala forum, who points out that I can compile the regex only once. Then I improved the program as below:
import scala.io.Source
val patt1 = """[^0-9a-zA-Z\s].*""".r
val patt2 = """[a-z0-9]+""".r
val lines = Source.fromFile("msg.txt").getLines()
for {
line <- lines
if ! patt1.matches(line)
word <- line.split("""\s+""").map(_.toLowerCase)
if patt2.matches(word) && word.size < 30
} {
println(word)
}
Re-run and it takes less 6 seconds than before, about 30 seconds to finish the job. Still much slower than perl.
Please notice: this updated program works only in scala 2.13. My Spark application requires scala 2.12, which doesn’t work as the way.
[ Update 2 ]
Scala’s regex is anchored by default. So it takes the full matching. To take a part matching as perl, could use this (in scala 2.13):
scala> val regex = """^hello""".r.unanchored
val regex: scala.util.matching.UnanchoredRegex = ^hello
scala> regex.matches("hello word")
val res0: Boolean = true
As you see, when declared as unanchored, the regex can take part matching.