This post is part of a series of posts that provide step-by-step instructions on how to write a simple web scraper using Ruby on morph.io. If you find any problems, let us know in the comments so we can improve these tutorials.
In the last post we dealt with the site’s pagination and started scraping a complete dataset. In this final post we work out how to save our data and publish our scraper to morph.io.
Scrapers on morph.io use the handy ScraperWiki library to save data to an SQLite database. This is how all data in morph.io is stored. Each scraper page provides options to download the SQLite database, a CSV file of each table, or access the data via an API.
You might remember seeing the ScraperWiki library listed as a dependency in your Gemfile earlier:
ruby "2.0.0"
gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
gem "mechanize"
To use this library in your scraper, you need to declare that it is required at the top of your scraper.rb
in the same way you have for the Mechanize library:
require 'mechanize'
require 'scraperwiki'
You can save data using the ScraperWiki.save_sqlite()
method. This method takes care of the messy business of creating a database and handling duplication for you. There are two augments you need to pass it: an array of the record’s unique keys so it knows when to override or update a record, and the data that you want to save.
A member’s full name is unique to them so you can use that as your unique key (we’ve called the field “title”). The data you want to save is your member
object. After your p member
statement is a good place to save your data.
p member
ScraperWiki.save_sqlite([:title], member)
Your scraper.rb
should now look like this:
require 'mechanize'
require 'scraperwiki'
agent = Mechanize.new
url = 'https://morph.io/documentation/examples/australian_members_of_parliament'
["1", "2", "3"].each do |page_number|
page = agent.get(url + "?page=" + page_number)
page.at('.search-filter-results').search('li').each do |li|
member = {
title: li.at('.title').inner_text.strip,
electorate: li.search('dd')[0].inner_text,
party: li.search('dd')[1].inner_text,
url: li.at('.title a').attr('href')
}
p member
ScraperWiki.save_sqlite([:title], member)
end
end
Save and run your file. The command line output should be unchanged— but if you view the files in your project directory you will see a new file data.sqlite
.
Great job. You’ve now written a scraper to collect data and save it to a database. It’s time to put your new scraper code on morph.io so you can show the world how cool you are—and so it can take care of running the thing, storing your data, and providing you easy access to it.
Running your scraper on morph.io
morph.io runs scraper code that is stored in public GitHub repositories. To run your scraper on morph.io, you’ll first have to push it back up to GitHub repository you originally cloned it from.
Start off with another git commit
to save any outstanding changes.
Push your changes up to your remote GitHub repository with:
> git push origin master
Now go view your scraper’s page on GitHub (the url will be something like github.com/yourusername/the_name_of_this_scraper). Navigate to view your scraper.rb
file on GitHub and see that it’s got all your local changes.
You can now go over to your scraper’s page on morph.io and click the “Run scraper” button near the top of the page. The moment of truth is upon us.
As your scraper runs you will see all your console output print the data for the members you are scraping. A few seconds later, underneath the heading “Data”, you’ll find a table showing a representative ten rows of data and buttons to download your data in a range of formats.
Take a moment to explore the download options and check that the data looks as you expected.
That’s all folks
Well done my friend, you’ve just written a web scraper.
With just a few lines of code you’ve collected information from a website and saved it in a structured format you can play with. You’ve published your work for all to see on morph.io and set it to run, store and provide access to your data.
If you want to get really fancy you can set your scraper to auto run daily on your scraper’s settings page so it’s stays up to date with any changes to the members list.
Before you go mad with power, go and explore some of the scrapers on morph.io. Try searching for topics you find interesting and domains you know. Get ideas for what to scrape next and learn from other peoples’ scraper code.
Remember to post questions to the help forums if you get blocked by tricky problems.
If you have any feedback on this tutorial we’d love to hear it.
Now go forth with your new powers and scrape all the things!
2 Comments
Mechanize is a great library but I don’t see how morph.io or scraperwiki really adds anything to the conversation. I recommend removing those dependencies. It’s nothing you can’t roll your own of in a few minutes.
Hey PG, thanks for your comment.
You’re right, you can use the techniques in this tutorial without morph.io or the scraperwiki library and save you data in some other way ?
For lots of people setting up and maintaining systems to regularly run scrapers is a drag though. The idea with morph.io is that it runs your scraper for you and gives you easy access to the data through an API or download. This tutorial is specifically framed around using morph.io.
We use morph.io with our project PlanningAlerts to keep data collection separate from the main app and to allow collaboration and contributions to our scrapers.
The ScraperWiki library is just a simple way to save the data in the way that morph.io will recognise it, but you could alternatively use anything that will create and manage records in a SQLite database for you. As you point out, you could save data in other ways too and even roll your own if your able, but this library just makes things really easy for use with morph.io for people getting started with scraping.
Does that explain why we’ve included those dependencies?