October 8th, 2009 Web | comments_icon4 Comments

Scrape headings into RSS with Yahoo Query Language

yql_iconI was trying some stuff in Yahoo Pipes and noticed YQL module, I hadn’t used before. YQL stands from Yahoo Query Language and turned out to be very interesting project. It provides unified access to online services and pages with SQL-like syntax.

Since scraping and processing generic HTML is one of Pipes’ weak spots, I decided to try it through YQL. This example will turn headings from any page into RSS feed.

Theory

YQL natively supports access to HTML and result can be narrowed down with XPath – language that is used to navigate through XML documents.

I am not proficient with XPath and had trouble making it work with both headings only and headings with link inside (typical for post titles in blogs and such). With help of Paul Donnelly on Twitter I ended up with following query:

select content from html where url='http://www.rarst.net'
and (xpath='//h2' or xpath='//h2/a')

It will look for headings or headings with link inside and return content from them. There are some quirks but it roughly works as I wanted it to.

Making the pipe

I aimed pipe to be generic solution, so site and tag should come from user.

yql_html_scrape

Two inputs mashed into string and query itself. This is times easier than trying to get stuff out of tag soup in usual way. After that there are only few boring cleanups to format it into RSS tags.

Overall

YQL is awesome addition to Pipes and works on its own as well. It may require serious time with related documentation, but usage is easy and compact with vast potential.

Pipe http://pipes.yahoo.com/rarst/scrapetag

YQL Home http://developer.yahoo.com/yql/

RSSGet updates via RSS

4 Responses to “Scrape headings into RSS with Yahoo Query Language”

  1. I had never heard about it. Being a big fan of SQL, I will sure give a shot to YQL someday. Thanks buddy.

  2. Rarst says:

    @Ishan

    I am not much of SQL fan. It tries to be flexible but ends up somewhat rigid. Maybe I just lack practice.

    As for me YQL value isn’t in picking SQL syntax (could be anything else), but in whole concept of unified access to very different online data. And plays with Pipes well.

    Yahoo toys for developers seem easier to use than those of Google.

  3. Richard.Williams says:

    I have been using biterscripting ( http://www.biterscripting.com ) for parsing RSS. It is excellent in parsing flat files, html, rss and other sources. It’s fairly efficient and when you strick with one tool, you don’t have to learn the syntax of every new tool. I will give YQL a try. (I am a Java programmer and know DB/SQL fairly well.)

    Richard

  4. Rarst says:

    @Richard

    Thanks for suggestion! I like plain text files and formats a lot so it sounds like an interesting tool. Bookmarked to check it out.

Leave a Reply