This post is part of a series of posts that provide step-by-step instructions on how to write a simple web scraper using Ruby on morph.io. If you find any problems, let us know in the comments so we can improve these tutorials.
In the past post we set up our scraper. Now we’re going to start out writing our scraper.
It can be really helpful to start out writing your scraper in an interactive shell. In the shell you’ll get quick feedback as you explore the page you’re trying to scrape, instead of having to run your scraper file to see what your code does.
The interactive shell for ruby is called irb. Start an irb session on the command line with:
> bundle exec irb
The bundle exec
command executes your irb
command in the context of your project’s Gemfile. This means that your specified gems will be available.
The first command you need to run in irb
is:
>> require 'mechanize'
This loads in the Mechanize library. Mechanize is a helpful library for making requesting and interacting with webpages.
Now you can create an instance of Mechanize that will be your agent to do things like ‘get’ pages and ‘click’ on links:
>> agent = Mechanize.new
You want to get information for all the members you can. Looking at your target page it turns out the members are spread across several pages. You’ll have to scrape all 3 pages to get all the members. Rather than worry about this now, lets start small. Start by just collecting the information you want for the first member on the first page. Reducing the complexity as you start to write your code will make it easier to debug as you go along.
In your irb session, use the Mechanize get
method to get the first page with members listed on it.
>> url = "https://morph.io/documentation/examples/australian_members_of_parliament"
>> page = agent.get(url)
This returns the source of your page as a Mechanize Page object. You’ll be pulling the information you want out of this object using the handy Nokogiri XML searching methods that Mechanize loads in for you.
Let’s review some of these methods.
at()
The at()
method returns the first element that matches the selectors provided. For example, page.at(‘ul’)
returns the first <ul>
element in the page as a Nokogiri XML Element that you can parse. There are a number of ways to target elements using the at()
method. We’re using a css style selector in this example because many people are familiar with this style from writing CSS or jQuery. You can also target elements by class
, e.g. page.at('.search-filter-results')
; or id
, e.g. page.at('#content')
.
search()
The search()
method works like the at()
method, but returns an Array of every element that matches the target instead of just the first. Running page.search('li')
returns an Array of every <li>
element in page
.
You can chain these methods together to find specific elements. page.at('.search-filter-results').at('li').search('p')
will return an Array of all <p>
elements found within the first <li>
element found within the first element with the class .search-filter-results
on the page.
You can use the at()
and search()
methods to get the first member list item on the page:
>> page.at('.search-filter-results').at('li')
This returns a big blob of code that can be hard to read. You can use the inner_text()
method to help work out if got the element you’re looking for: the first member in the list.
>> page.at('.search-filter-results').at('li').inner_text
=> "\n\nThe Hon Ian Macfarlane MP\n\n\n\n\n\nMember for\nGroom,Queensland\nParty\nLiberal Party of Australia\nConnect\n\nEmail\n\n\n"
Victory!
Now that you’ve found your first member, you want to collect their title, electorate, party, and the url for their individual page. Let’s start with the title.
If you view the page source in your browser and look at the first member list item, you can see that the title of the member, “The Hon Ian Macfarlane MP”, is the text inside the link in the <p>
with the class ‘title’.
<li>
<p class='title'>
<a href="http://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=WN6">
The Hon Ian Macfarlane MP
</a>
</p>
<p class='thumbnail'>
<a href="http://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=WN6">
<img alt="Photo of The Hon Ian Macfarlane MP" src="http://parlinfo.aph.gov.au/parlInfo/download/handbook/allmps/WN6/upload_ref_binary/WN6.JPG" width="80" />
</a>
</p>
<dl>
<dt>Member for</dt>
<dd>Groom, Queensland</dd>
<dt>Party</dt>
<dd>Liberal Party of Australia</dd>
<dt>Connect</dt>
<dd>
<a class="social mail" href="mailto:Ian.Macfarlane.MP@aph.gov.au"
target="_blank">Email</a>
</dd>
</dl>
</li>
You can use the .inner_text
method here.
>> page.at('.search-filter-results').at('li').at('.title').inner_text
=> "\nThe Hon Ian Macfarlane MP\n"
There it is: the title of the first member. It’s got messy \n
whitespace characters around it though. Never fear, you can clean it up with the Ruby method strip
.
>> page.at('.search-filter-results').at('li').at('.title').inner_text.strip
=> "The Hon Ian Macfarlane MP"
You’ve successfully scraped the first bit of information you want.
Now that you’ve got some code for your scraper, let’s add it to your scraper.rb
file and make your first commit.
You’ll want to come back to your irb
session, so leave it running and open your scraper.rb
file in your code editor. Replace the commented out template code with the working code from your irb
session.
Your scraper.rb
should look like this:
require 'mechanize'
agent = Mechanize.new
url = 'https://morph.io/documentation/examples/australian_members_of_parliament'
page = agent.get(url)
page.at('.search-filter-results').at('li').at('.title').inner_text.strip
You actually want to collect members with this scraper, so create a member object and assign the text you’ve collected as it’s title:
require 'mechanize'
agent = Mechanize.new
url = 'https://morph.io/documentation/examples/australian_members_of_parliament'
page = agent.get(url)
member = {
title: page.at('.search-filter-results').at('li').at('.title').inner_text.strip
}
Add a final line to the file to help confirm that everything is working as expected.
p member
You can now, back in on the command line in the folder for your project, run this file in Ruby:
> bundle exec ruby scraper.rb
The scraper runs and the p
command returns your member
:
> bundle exec ruby scraper.rb
{:title=>"The Hon Ian Macfarlane MP"}
This is a good time to make your first git commit for this project. Yay!
In our next post we’ll work out how to scrape more bits of information from the page.
Who comments in PlanningAlerts and how could it work better?
In our last two quarterly planning posts (see Q3 2015 and Q4 2015), we’ve talked about helping people write to their elected local councillors about planning applications through PlanningAlerts. As Matthew wrote in June, “The aim is to strengthen the connection between citizens and local councillors around one of the most important things that local government does which is planning”. We’re also trying to improve the whole commenting flow in PlanningAlerts.
I’ve been working on this new system for a while now, prototyping and iterating on the new comment options and folding improvements back into the general comment form so everybody benefits.
About a month ago I ran a survey with people who had made a comment on PlanningAlerts in the last few months. The survey went out to just over 500 people and we had 36 responders–about the same percentage turn-out as our PlanningAlerts survey at the beginning of the year (6% from 20,000). As you can see, the vast majority of PlanningAlerts users don’t currently comment.
We’ve never asked users about the commenting process before, so I was initially trying to find out some quite general things:
The responses include some clear patterns and have raised a bunch of questions to follow up with short structured interviews. I’m also going to have these people use the new form prototype. This is to weed out usability problems before we launch this new feature to some areas of PlanningAlerts.
Here are some of the observations from the survey responses:
Older people are more likely to comment in PlanningAlerts
We’re now run two surveys of PlanningAlerts users asking them roughly how old they are. The first survey was sent to all users, this recent one was just to people who had recently commented on a planning application through the site.
Compared to the first survey to all users, responders to the recent commenters survey were relatively older. There were less people in their 30s and 40s and more in their 60s and 70s. Older people may be more likely to respond to these surveys generally, but we can still see from the different results that commenters are relatively older.
Knowing this can help us better empathise with the people using PlanningAlerts and make it more usable. For example, there is currently a lot of very small, grey text on the site that is likely not noticeable or comfortable to read for people with diminished eye sight—almost everybody’s eye sight gets at least a little worse with age. Knowing that this could be an issue for lots of PlanningAlerts users makes improving the readability of text a higher priority.
There’s a good understanding that comments go to planning authorities, but not that they go to neighbours signed up to PlanningAlerts
To “Who do you think receives your comments made on PlanningAlerts?” 86% (32) of responders checked “Local council staff”. Only 35% (13) checked “Neighbours who are signed up to PlanningAlerts”. Only one person thought their comments also went to elected councillors.
There seems to be a good understanding amongst these commenters that their comments are sent to the planning authority for the application. But not that they go to other people in the area signed up to PlanningAlerts. They were also very clear that their comments did not go to elected councillors.
In the interviews I want to follow up on this are find out if people are positive or negative about their comments going to other locals. I personally think it’s an important part of PlanningAlerts that people in an area can learn about local development, local history and how to impact the planning process from their neighbours. It seems like an efficient way to share knowledge, a way to strengthen connections between people and to demonstrate how easy it is to comment. If people are negative about this then what are their concerns?
“I have no idea if the comments will be listened to or what impact they will have if any”
There’s a clear pattern in the responses that people don’t think their comments are being listened to by planning authorities. They also don’t know how they could find out if they are. One person noted this as a reason to why they don’t make more comments.
Giving people simple access to their elected local representatives, and a way to have a public exchange with them, will hopefully provide a lever to increase their impact.
“I would only comment on applications that really affect me”
There was a strong pattern of people saying they only comment on applications that will effect them or that are interesting to them:
How do people decide if an application is relevant to them? Is there a common criteria?
Why don’t you comment on more applications? “It takes too much time”
A number of people mentioned that commenting was a time consuming process, and that this prevented them from commenting on more applications:
What are people’s basic processes for commenting in PlanningAlerts? What are the most time consuming components of this? Can we save people time?
“I have only commented on applications where I have a knowledge of the property or street amenities.”
A few people mentioned that they feel you should have a certain amount of knowledge of an application or area to comment on it, and that they only comment on applications they are knowledgeable about.
How does someone become knowledgeable about application? What is the most important and useful information about applications?
Comment in private
A small number of people mentioned that they would like to be able to comment without it being made public.
Suggestions & improvements
There were a few suggestions for changes to PlanningAlerts:
Summing up PlanningAlerts
We also had a few comments that are just nice summaries of what is good about PlanningAlerts. It’s great to see that there are people who understand and can articulate what PlanningAlerts does well:
Next steps
If we want to make using PlanningAlerts a intuitive and enjoyable experience we need to understand the humans at the centre of it’s design. This is a small step to improve our understanding of the type of people who comment in PlanningAlerts, some of their concerns, and some of the barriers to commenting.
We’ve already drawn on the responses to this survey in updating wording and information surrounding the commenting process to make it better fit people’s mental model and address their concerns.
I’m now lining up interviews with a handful of the people who responded to try and answer some of the questions raised above and get to know them more. They’ll also show us how they use PlanningAlerts and test out the new comment form. This will highlight current usability problems and hopefully suggest ways to make commenting easier for everyone.
Design research is still very new to the OpenAustralia Foundation. Like all our work, we’re always open to advice and contributions to help us improve our projects. If you’re experienced in user research and want to make a contribution to our open source projects to transform democracy, please drop us a line or come down to our monthly pub meet. We’d love to hear your ideas.