Ruby web scraping tutorial on morph.io – Part 2, start writing your scraper

This post is part of a series of posts that provide step-by-step instructions on how to write a simple web scraper using Ruby on morph.io. If you find any problems, let us know in the comments so we can improve these tutorials.


In the past post we set up our scraper. Now we’re going to start out writing our scraper.

It can be really helpful to start out writing your scraper in an interactive shell. In the shell you’ll get quick feedback as you explore the page you’re trying to scrape, instead of having to run your scraper file to see what your code does.

The interactive shell for ruby is called irb. Start an irb session on the command line with:

> bundle exec irb

The bundle exec command executes your irb command in the context of your project’s Gemfile. This means that your specified gems will be available.

The first command you need to run in irb is:

>> require 'mechanize'

This loads in the Mechanize library. Mechanize is a helpful library for making requesting and interacting with webpages.

Now you can create an instance of Mechanize that will be your agent to do things like ‘get’ pages and ‘click’ on links:

>> agent = Mechanize.new

You want to get information for all the members you can. Looking at your target page it turns out the members are spread across several pages. You’ll have to scrape all 3 pages to get all the members. Rather than worry about this now, lets start small. Start by just collecting the information you want for the first member on the first page. Reducing the complexity as you start to write your code will make it easier to debug as you go along.

In your irb session, use the Mechanize get method to get the first page with members listed on it.

>> url = "https://morph.io/documentation/examples/australian_members_of_parliament"
>> page = agent.get(url)

This returns the source of your page as a Mechanize Page object. You’ll be pulling the information you want out of this object using the handy Nokogiri XML searching methods that Mechanize loads in for you.

Let’s review some of these methods.

at()

The at() method returns the first element that matches the selectors provided. For example, page.at(‘ul’) returns the first <ul> element in the page as a Nokogiri XML Element that you can parse. There are a number of ways to target elements using the at() method. We’re using a css style selector in this example because many people are familiar with this style from writing CSS or jQuery. You can also target elements by class, e.g. page.at('.search-filter-results'); or id, e.g. page.at('#content').

search()

The search() method works like the at() method, but returns an Array of every element that matches the target instead of just the first. Running page.search('li') returns an Array of every <li> element in page.

You can chain these methods together to find specific elements. page.at('.search-filter-results').at('li').search('p') will return an Array of all <p> elements found within the first <li> element found within the first element with the class .search-filter-results on the page.

You can use the at() and search() methods to get the first member list item on the page:

>> page.at('.search-filter-results').at('li')

This returns a big blob of code that can be hard to read. You can use the inner_text() method to help work out if got the element you’re looking for: the first member in the list.

>> page.at('.search-filter-results').at('li').inner_text
=> "\n\nThe Hon Ian Macfarlane MP\n\n\n\n\n\nMember for\nGroom,Queensland\nParty\nLiberal Party of Australia\nConnect\n\nEmail\n\n\n"

Victory!

Now that you’ve found your first member, you want to collect their title, electorate, party, and the url for their individual page. Let’s start with the title.

If you view the page source in your browser and look at the first member list item, you can see that the title of the member, “The Hon Ian Macfarlane MP”, is the text inside the link in the <p> with the class ‘title’.

<li>
  <p class='title'>
    <a href="http://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=WN6">
      The Hon Ian Macfarlane MP
    </a>
  </p>
  <p class='thumbnail'>
    <a href="http://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=WN6">
      <img alt="Photo of The Hon Ian Macfarlane MP" src="http://parlinfo.aph.gov.au/parlInfo/download/handbook/allmps/WN6/upload_ref_binary/WN6.JPG" width="80" />
    </a>
  </p>
  <dl>
    <dt>Member for</dt>
    <dd>Groom, Queensland</dd>
    <dt>Party</dt>
    <dd>Liberal Party of Australia</dd>
    <dt>Connect</dt>
    <dd>
      <a class="social mail" href="mailto:Ian.Macfarlane.MP@aph.gov.au"
      target="_blank">Email</a>
    </dd>
  </dl>
</li>

You can use the .inner_text method here.

>> page.at('.search-filter-results').at('li').at('.title').inner_text
=> "\nThe Hon Ian Macfarlane MP\n"

There it is: the title of the first member. It’s got messy \n whitespace characters around it though. Never fear, you can clean it up with the Ruby method strip.

>> page.at('.search-filter-results').at('li').at('.title').inner_text.strip
=> "The Hon Ian Macfarlane MP"

You’ve successfully scraped the first bit of information you want.

Now that you’ve got some code for your scraper, let’s add it to your scraper.rb file and make your first commit.

You’ll want to come back to your irb session, so leave it running and open your scraper.rb file in your code editor. Replace the commented out template code with the working code from your irb session.

Your scraper.rb should look like this:

require 'mechanize'

agent = Mechanize.new
url = 'https://morph.io/documentation/examples/australian_members_of_parliament'
page = agent.get(url)

page.at('.search-filter-results').at('li').at('.title').inner_text.strip

You actually want to collect members with this scraper, so create a member object and assign the text you’ve collected as it’s title:

require 'mechanize'

agent = Mechanize.new
url = 'https://morph.io/documentation/examples/australian_members_of_parliament'
page = agent.get(url)

member = {
  title: page.at('.search-filter-results').at('li').at('.title').inner_text.strip
}

Add a final line to the file to help confirm that everything is working as expected.

p member

You can now, back in on the command line in the folder for your project, run this file in Ruby:

> bundle exec ruby scraper.rb

The scraper runs and the p command returns your member:

> bundle exec ruby scraper.rb
{:title=>"The Hon Ian Macfarlane MP"}

This is a good time to make your first git commit for this project. Yay!

In our next post we’ll work out how to scrape more bits of information from the page.

This entry was posted in Morph and tagged , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*
*

Subscribe without commenting

  • Occasional News

    Stay in the loop with occasional news and notes from the OpenAustralia Foundation in your inbox.

  • Categories

  • Archives

    • [+]2018
    • [+]2017
    • [+]2016
    • [+]2015
    • [+]2014
    • [+]2013
    • [+]2012
    • [+]2011
    • [+]2010
    • [+]2009
    • [+]2008
    • [+]2007