Scraping javascript sites with morph.io

Just a quick post to let you know that it’s now possible to scrape javascript heavy sites easily using our scraping platform morph.io.

This is really useful with Microsoft .NET web sites that often use complicated states stored in javascript with links simulated via javascript posts.

Also, we recently discovered another more worrying example. The main website of the NSW Electoral commission, who oversee state elections in NSW, is “protected” by some anti-scraping technology that stops you from being able to download the contents of a web page without javascript. This is clearly terrible for accessibility and in our case for getting access to basic electoral information which is not available by any other means than scraping.

Thankfully…

phantomjs

PhantomJS is now installed for everyone using the experimental buildpack support.

PhantomJS is essentially a headless browser that you can control from your scraper using javascript or alternatively via wrapper libraries available for most major languages.

If you want to use PhantomJS but are not yet using the buildpack support, use this as a little bit of extra incentive to move over to it. All you need to do is ask us to enable it for you (letting us know which user or organisation you would like it for)

This entry was posted in Announcement, Morph. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*
*

Subscribe without commenting

  • Occasional News

    Stay in the loop with occasional news and notes from the OpenAustralia Foundation in your inbox.

  • Categories

  • Archives

    • [+]2018
    • [+]2017
    • [+]2016
    • [+]2015
    • [+]2014
    • [+]2013
    • [+]2012
    • [+]2011
    • [+]2010
    • [+]2009
    • [+]2008
    • [+]2007