A couple of remarks about XBMC scraper development

I’m a big fan of XBMC. One of its most important features is gathering metadata from various websites (a process called scraping), depending on the type of content (e.g. IMDB for movies, TheTVDB for TV shows etc.). I’ve been using smuto’s Filmweb scraper for scraping movie information for a while now without any major problems. Yet recently Filmweb changed their layout (again…), rendering the scraper useless. I got tired of waiting for an updated version of smuto’s scraper and started developing my own from scratch. It turned out it wasn’t as easy as I would like it to be.

To develop something more than a really trivial scraper, a scraper editor is essential for two purposes:

  1. To present the scraper as a collapsible tree of regular expressions. This gets more and more important as your scraper evolves into a complex mess of XML tags, regular expressions, buffers and functions. A visual representation is a lot easier to grasp than a raw XML file.
  2. To convert all the special characters into SGML character references. The heart of every scraper is an XML file used for generating… XML (“yo dawg, I heard you like XML…”). Because of that, there’s a lot of these conversions going on which can quickly obfuscate the reasoning behind every line. Need an example? Here’s one:
    <entity><title>\2</title><year>\3</year><url>http://www.filmweb.pl\1</url></entity>
    

    An editor allows you to concentrate on the important stuff and forget about the need to properly encode every single returned XML tag.

Unfortunately, there are two types of tools for scraper development at the moment: the ones that are discontinued and the ones with low code maturity. There is some work in progress – a tool called Scraper Editor is being developed. Yet any good editor should give the developer the possibility to test the contraption he’s created and this is where Scraper Editor fails for me at the moment, spitting out the following error when trying to test the scraper for any title:

[ERROR] java.lang.ClassCastException: hu.yvs.xbmc.xml.addon.scraper.Function cannot be cast to javax.xml.bind.JAXBElement

I decided to use the old ScraperXML tool (which was apparently discontinued back in November 2010) for the purpose of creating my new scraper. ScraperXML also has some severe drawbacks:

  • it uses a hardcoded MSIE 6.0 HTTP user agent – if the scraped website doesn’t support old IE versions (like Filmweb), you’re done (unless you’re willing to setup a web proxy capable of rewriting HTTP headers),
  • it doesn’t support cookies – if the scraped website relies on a cookie to allow the visitor past an initial ad page (like Filmweb), you’re done (unless you’re willing to setup a web proxy capable of storing cookies between requests – XBMC does this automatically for all libcurl requests),
  • it doesn’t support the fixchars attribute – the attribute used for converting SGML character references from scraped website excerpts into proper characters inside XBMC; even if you add the fixchars attribute manually to the XML file and then edit that file using ScraperXML, all occurrences of this attribute will be removed,
  • regional settings awareness is buggy – if you try to open a valid scraper XML file containing the string <scraper framework="1.1" on a system with regional settings specifying the comma as the decimal mark, all you'll get is an error message stating "Input string was in incorrect format.", without any information regarding where exactly did the parser fail; you won't be able to open the scraper at all; I'm guessing this is .NET-related and the author simply didn't check the program on a lot of different machines,
  • it occasionally enters an endless loop when testing complex scrapers using URL functions – I wasn't able to pinpoint the exact reason for this, but in some cases the testing module can simply start calling the same function over and over (which doesn't happen in XBMC using the same XML file).

From my standpoint, developing scrapers is currently harder than it should be. Ideally, all that should be needed to develop a scraper is basic knowledge of HTTP and regular expressions. I guess there isn't much need for scraper editors as most XBMC end users will simply use whatever is available in the official XBMC add-on repository instead of doing some regular expressions/XML voodoo just to have a movie title displayed in their native language etc. It's a shame as once you get past the initial obstacles, scraper development becomes good fun as the engine itself is pretty powerful and not so complex for the developer as one might think looking at the raw XML file. I did my best to comment my own Filmweb scraper thoroughly, so check it out if you're going to develop your own scraper – it might be helpful for beginners.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s