RubyFlow The Ruby and Rails community linklog

A simple intro to writing an S-Expression parser in Ruby

A short and easy intro to the world of parsers. S-Expressions are amongst the easiest of syntaxes to parse so they are a great candidate if you have never written a parser before. Check out the article.

Comments

“In order to simplify the problem for ourselves the first thing we are going to do is find all of the string literals, copy them into an array, and then replace them with a special placeholder string”

WTF??? Lexers and parsers aren’t like that! That’s super inneficient… sorry, not an article I would recommend.

I’m well aware that this is not an optimally efficient solution, however it is much simpler than writing a parser using only regular expressions, or tackling the same problem using a PEG or similar. (Using a recursive descent parser or packrat parser written in Ruby is also slower than this method)

Do you have a suggestion as to how to improve the code? I’m always willing to listen to feedback!

So, some quick benchmarking against another well-known Ruby S-Expression parser shows that my gem based on the code in this article is roughly 5 times faster!

A gist containing the simple benchmarking code is here.

I’m going to put together an example packrat parser using Treetop over the next few minutes to see how that fares… I’ll post back in a minute.

Hi Aaron!

Sorry, maybe I sounded a bit harsh with that post. It’s just that I love lexers and parsers, haha.

Here’s a gist with my code:

http://gist.github.com/626637

You can see I traverse the string just once: no gsub, no shifts, no replacements, nothing!

Could you benchmark it?

So the Treetop generated packrat-parser is about 13x slower than my version of the parser. I’ll be posting the code up on GitHub shortly…

Hey Ary! No worries! I’m always interested in hearing feedback, particularly from people that know what they’re doing, as I’m relatively new to writing parsers…

The benchmark results are here: Sexpistol 2.260000 0.000000 2.260000 ( 2.256473) SXP 11.130000 0.120000 11.250000 ( 11.239132) Ary 3.320000 0.000000 3.320000 ( 3.325837)

Sexpistol is my gem based off the code in the article. I did notice though that you were creating new instances of the parser and lexer whenever parse() was called, given that the benchmark calls parse() 5000 times I thought that might be slowing down the code, so I changed the test so that Sexpistol created a new parser instance for each iteration as well:

Sexpistol 2.290000 0.010000 2.300000 ( 2.298358) SXP 11.180000 0.100000 11.280000 ( 11.267062) Ary 3.320000 0.010000 3.330000 ( 3.317399)

I’m reading through the code now to make sure I understand it…

Here’s the benchmark code: http://gist.github.com/626681

So I think this is kind of an unfair comparison. Your LR parser is probably more flexible than my version, but I’m able to get away with what I’ve done because S-Expressions are so simple.

I also get a speed advantage because I’m using Ruby’s built-in functions like split() and regular expressions which are highly optimized. I’ve also had time to do profiling and some minor optimizations on my version whereas you have not…

I really like your version! It’s very straight-forward and I’m having fun understanding it!

How does this compare to the S-Expr parser in Ripper (now part of ruby’s standard library)?

Yes! I also did the benchmark and was surprised my version was slower than yours. I might need to profile the code, maybe the @str[@p].chr is taking a lot of time, I don’t know.

@Eric: From what I can see Ripper is a parser that takes Ruby source and turns it into an S-Expression representation, not a parser for S-Expressions… Are you able to shed any light on that? Is there some hidden method in there for parsing sexps?

@Ary: I had a quick look over it to see what hunches I had. My thought was the range comparisons… I modified the code to use regexes instead where possible, this saved around 0.2s, then I removed the @strp[@p].chr because I’m running Ruby 1.9 and you get strings anyway, this saved 0.4s

Current benchmark:

Sexpistol 2.270000 0.000000 2.270000 ( 2.273966) SXP 11.170000 0.110000 11.280000 ( 11.258694) Ary 2.620000 0.010000 2.630000 ( 2.624496)

The modified code is up at: http://gist.github.com/626825

Gents Ruby’s StringScanner (strscan) seems made for this sort of stuff.

I have never heard of S-Expressions so I may have missed operators or whatever but I love a good benchmark.

The quick Lexer replacement I wrote for Ary’s code is ~2x as fast: Rehearsal --------------------------------------------- Sexpistol 0.700000 0.020000 0.720000 ( 1.479529) SXP 3.710000 0.190000 3.900000 ( 8.240412) Shanna 0.340000 0.030000 0.370000 ( 0.754847) ------------------------------------ total: 4.990000sec

            user     system      total        real Sexpistol   0.520000   0.000000   0.520000 (  1.454859) SXP         4.250000   0.240000   4.490000 (  8.149670) Shanna      0.390000   0.010000   0.400000 (  0.746545) </code>

If you need to support exponents etc the regexp to match numbers will need to be improved but you get the idea :)

Would probably help if I included the code. Sorry about that.

New winner!

That’s awesome Shanna! I peripherally knew about StringScanner, but I had never actually used it…

Sounds like my Sexpistol gem will be getting a make-over in the near future!

Thanks!

Awesome Shanna!! I totally forgot about StringScanner! I used it once a long time ago… :-P

Hey Shanna, what if I my language has keywords, say “and” and “or”, and I want to scan those? If I do scan /and/ then it works but it also matches the word “andy”. So I can do scan /and\W/, but then if I have “and+” I miss the “+” (which I want to return later as a token). Any idea of how to go back before the “+”?

Oh, nevermind, I think \b is what I was looking for…

Just to wrap this up, the final benchmarks with a re-written version of Sexpistol look like this:

Sexpistol 1.230000 0.000000 1.230000 ( 1.233096) SXP 11.130000 0.110000 11.240000 ( 11.260526) Shannah 0.960000 0.010000 0.970000 ( 0.967238)

Sexpistol is slower than Shannah’s code purely because it has to deal with more types of entities (Float, comma quoting etc…).

The final code is available at GitHub

This was fun! Thank you very much for this discussion. I’ve updated my Google Query Language parser with what I’ve learned here, and afterwards did some benchmarks and it’s working much faster now. :-)

http://code.google.com/p/rgviz/source/browse/lib/rgviz/lexer.rb

Post a comment

You can use basic HTML markup (e.g. <a>) or Markdown.

As you are not logged in, you will be
directed via GitHub to signup or sign in