Does Ruby's support for XML suck? I need your input.
Robert Fischer posts about his issues with the lack of commitment for XML on Ruby. He doesn’t like REXML too much either.
I want to write something on Ruby Inside to encourage people to get involved with implementing XML libraries (or improving those we already have). As no XML expert, however, I need your input to put out the call. Feel free to comment here on RubyFlow and let’s get a discussion going.
Comments
The problem with XML parsing is that it’s deceptive. People think of it as just angle brackets, and so they start throwing regular expressions at it, but the reality is that it is a lot more tricky than that. Declarations, entities, namespaces, XSLT, and XML Schema are all major technologies that are mandatory for an actually compliant XML parser. And all of these techniques are used even in such common places as RSS feed parsing and web services.
You cannot be playing in the real world – the world of web services and RSS feeds and interoperability – until you have the XML parsing problem solved.
Worse, Ruby on Rails has a competitor – Groovy on Grails. And Groovy has a very nice, very succinct, very compliant XML parser called “XMLSlurper”. So if Ruby on Rails doesn’t want to lose people to Groovy on Grails, they’re going to have to step it up and at least keep pace.
I’m currently working on a pure Ruby implementation of SAML and as part of that I need support for XML Encryption and XML Signatures. Those in turn require XML canonicalization, which is not something that exists right now in REXML (as far as I can tell). In short, I tend to agree with Robert Fischer about the lack of commitment to XML in Ruby and I for one would like to see it improve (and am doing my part through open source implementations of the items I described above).
There was a conversation here last month on XML Parsing Benchmarks. Personally I’ve only ever used Hpricot for XML parsing and I have never needed to look for another library.
libxml is obviously the most efficient parser, but it is not very Ruby-ish. I would love to see something that could marry the speed of libxml with a pretty API like REXML and Hpricot use.
On the other hand, Rubyists rarely have to use XML except to talk to web-services written in other languages. YAML and JSON provide much more efficient and simple solutions for data interchange.
I would agree that ruby support for xml is kind of half assed. It was easier to resort to an external non ruby package for XML validation than to get it to work reliably in Ruby.
Personally, I use Hpricot to parase XML and it works wonderfully. Some of the things mentioned, like attributes, are supported by default since they are also present in HTML. I’m not an XML expert though, and I’ve never had to parse anything shockingly difficult, but for example, parsing Google Reader’s XML output with Hpricot worked perfectly.
I’ve been playing with the Amazon AWS API, its all xml and not the nice kind, it’s the kind that gives you headaches. *wishes they would just give me JSON.
Thanks for all your input so far; this is really useful! As I said before, I don’t really understand the XML space, but it’s certainly important and I agree that more needs to be done in this area with Ruby. Hopefully a discussion on Ruby Inside may provoke some further interest (and work) in this area, even if it’s just more effort into making Hpricot support some of the more arcane features of XML (or, dare I say, forking Hpricot into an XML specific variant).
I’ve been using libxml with Ruby bindings for a largish DocBook project, and it’s been working fine. Hpricot is fun for smaller stuff, but it doesn’t handle XInclude.
On the XML generation side, Builder::XmlMarkup rocks.
yeah, so hpricot is just about amazing. and that’s all one really needs to know in the ruby vs. xml battle. oh, it parses xmls retarded little brother, html, quite nicely as well
Tim Bray himself has blogged about XML parsing in Ruby. Maybe^WCertainly worth a look.
Cheers, —Torsten
Here’s yet anotherHpricot trick.
Perhaps there are two pieces in this. One about XML and Ruby, and one about Hpricot tricks and tips! :)
Hpricot is not an XML parser, and probably shouldn’t attempt to be one – its forte is parsing (possibly malformed) HTML well, and should stick to that simple, noble goal.
For example, Hpricot doesn’t do namespaces very well, can only parse documents/fragments in their entirety (no pull parser for streamed things like XMPP), and can only search by CSS selector – not XPath, as you would expect.
While libxml is probably the most proven XML library in existence — and plenty fast as well — the libxml-ruby bindings are awful and deserve to be taken out back and shot.
As an alternative, FastXML — a new set of libxml bindings inspired by the Hpricot API — are taking shape, but is still too limited for most applications (for example, there’s no way to programmatically traverse a document, eg. for printing).
I think the lack of commitment to XML in the Ruby world is because of XML’s own lack of commitment to brevity and pragmatism. :) I don’t think Rubyists and XMLists share much of a Venn diagram.
I have had some good XML-related contribs to Hpricot lately and I hope to roll them up soon. But I definitely don’t think Hpricot will serve the XML guys who need validating schemas and nice entity support and caninonicomicibalization and the like. It’ll always only for scripting here-to-there in 60 seconds or less. I guess we’ll see.
@Alexander Hpricot does have XPath as well as CSS selector support.
@MetaSkills: Actually it has a small and largely nonfunctional subset. Try any of these to see what I mean:
/foo/text() /foo/node() /foo[@a = ‘a’ and @b = ‘b’] /foo[position() = 1] /foo::text() /foo/(a|b)/bar /foo[attribute:a = ‘a’]
Hpricot also lacks support for namespaces, last I was aware. It is really a permissive parser for HTML, not XML.
libxml has great functionality. I use it in Python via the lxml bindings. But it took about three or four attempts to make good bindings before one really took off, and it also required someone to champion that library (over the course of years). The naive libxml-python bindings that came with libxml were pretty much unusable. If you want to shoot high in terms of XML support, a good library wrapping libxml would be it.
But if you want to get something that just works, I’d port ElementTree from Python. It’s functionality isn’t complete as some other libraries, but it’s a fairly conservative and reliable design, leaves out most of the API cruft of other libraries, and can serve as a good basis for further development.
I offer these suggestions in part because the current XML angst in the Ruby world reminds me a lot of the past angst about XML in the Python world, where we had similar problems for a long time (including an unmaintained and unappealing package in the standard library). Maybe you could shortcut a year or two of the futzing around that we’ve done.
When I last tried (November 2007) to use XPath with Hpricot to extract data out of valid XML content, I hit a brick wall due to incomplete XPath support. REXML ended up doing what I needed in terms of just working. That is, I got what I wanted the way I expected to be able to ask for it, but was a bit slow due to the size of documents I was feeding it. This was a naive implementation (aka, a hack) though, and it ended up not mattering, so I didn’t go forward.
I would argue against extending Hpricot, though it would have been nice to have full XPath support in it at the time.
Post a comment