Ruby web scraping tutorial on morph.io – Part 4, dealing with pagination

This post is part of a series of posts that provide step-by-step instructions on how to write a simple web scraper using Ruby on morph.io. If you find any problems, let us know in the comments so we can improve these tutorials.


In the last post we finished collecting the data we want but discovered we needed to collect it over several pages. In this post we learn how to deal with this pagination. There are number of techniques of dealing with pagination and the one we present here is deliberately simple.

Visit the target page in your browser and navigate between the different pages using the links above the members list. Notice that when you go to page 2 the url is mostly the same except it has the query string ?page=2 on the end:

https://morph.io/documentation/examples/australian_members_of_parliament?page=2

When scraping websites pay close attention to the page URLs and their query strings. They often include clues to help you scrape.

It turns out you can navigate between the different member pages by just changing the page number to 1, 2 or 3 in the query string.

You can use what you’ve discovered as the basis for another each loop. This time you want to make a loop that runs your scraping code for each page.

You know that the three pages with members are pages 1, 2 and 3. Create an Array of these page numbers ["1", "2", "3"] and then loop through these numbers to run your get request and scraping code for each page.

require 'mechanize'

agent = Mechanize.new
url = 'https://morph.io/documentation/examples/australian_members_of_parliament'

["1", "2", "3"].each do |page_number|
  page = agent.get(url + "?page=" + page_number)

  page.at('.search-filter-results').search('li').each do |li|
    member = {
      title: li.at('.title').inner_text.strip,
      electorate: li.search('dd')[0].inner_text,
      party: li.search('dd')[1].inner_text,
      url: li.at('.title a').attr('href')
    }

    p member
  end
end

Save and run your scraper.rb. You should now see all 150 members details printed. Well done! You should do a git commit for this working code.

This is great—but there’s one more step. You’ve written a scraper that collects the details of members of Parliament and prints them to the command line— but you actually want to save this data. You need to store the information you’ve scraped so you can use it in your projects and that’s one of the things we’ll cover in our next and final post.

This entry was posted in Morph and tagged , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*
*

Subscribe without commenting