Advanced web scraping with Mechanize
In my last post I gave a basic introduction to web scraping with Ruby and Nokogiri. At the end of that post I mentioned that for more “advanced” scraping Mechanize was worth looking into.
This post explains how to do some more advanced web scraping using Mechanize, which builds on top of Nokogiri’s excellent HTML processing support.
Scraping Pitchfork reviews
Mechanize provides an out-of-the-box scraping solution that can handle filling in forms, following links and respecting a site’s robots.txt file. Here I’ll show you how it can be used to scrape the latest reviews from Pitchfork 1.
Reviews are spread across multiple pages, so we can’t simply fetch a single page and parse it with Nokogiri. This is where Mechanize can help with its ability to click on links and follow them to other pages.
Setup
First we’ll need to install Mechanize and its dependencies from Rubygems.
$ gem install mechanize
With Mechanize installed we can now start writing our scraper. Create a file called scraper.rb
and add the following require
statements. These specify the dependencies we need for this script. date
and json
are part of Ruby’s standard library, so there’s no need to install them separately.
require 'mechanize'
require 'time'
require 'json'
Now we can start using Mechanize. First thing we need to do is create a new instance of Mechanize (agent
) and then use it to fetch a remote webpage (page
).
agent = Mechanize.new
page = agent.get("https://pitchfork.com/reviews/albums/")
Find links to reviews
Now we can use the page
object to find links to reviews. Mechanize provides a .links_with
method which, as the name suggests, finds links with the given attributes. Here we look for links which match a regular expression.
This returns an array of links, but we only want links to reviews, not pagination. To remove unwanted links we can call .reject
on the array of links and reject any which look like pagination links.
review_links = page.links_with(href: %r{^/reviews/albums/\w+})
review_links = review_links.reject do |link|
parent_classes = link.node.parent['class'].split
parent_classes.any? { |p| %w[next-container page-number].include?(p) }
end
For the purposes of demonstration—and so we don’t completely hammer Pitchfork’s server’s—we’ll just take the first four review links.
review_links = review_links[0...4]
Process each review
We now have a list of Mechanize links which we want to map to the reviews that they link to. Since they’re in an array we can call .map
on it and return a hash from each iteration.
The Mechanize page
object has a .search
method which delegates to Nokogiri’s .search
method. This means that we can use a CSS selector as an argument to .search
and it will return an array of matching elements.
reviews = review_links.map do |link|
review = link.click
artist = review.search('.artist-links').text
album = review.search('.single-album-tombstone__review-title').text
label, year = review.search('.single-album-tombstone__meta').text.split('•').map(&:strip)
reviewer = review.search('.authors-detail__display-name').text
review_date = Time.parse(review.search('.pub-date')[0]['title'])
score = review.search('.score').text.to_f
{
artist: artist,
album: album,
label: label,
year: year,
reviewer: reviewer,
review_date: review_date,
score: score
}
end
Now we’ve got an array of review hashes we can output the reviews in JSON format.
puts JSON.pretty_generate(reviews)
All together now
Here’s the whole script:
require 'mechanize'
require 'date'
require 'json'
agent = Mechanize.new
page = agent.get("http://pitchfork.com/reviews/albums/")
review_links = page.links_with(href: %r{^/reviews/albums/\w+})
review_links = review_links.reject do |link|
parent_classes = link.node.parent['class'].split
parent_classes.any? { |p| %w[next-container page-number].include?(p) }
end
review_links = review_links[0...4]
reviews = review_links.map do |link|
review = link.click
artist = review.search('.artist-links').text
album = review.search('.single-album-tombstone__review-title').text
label, year = review.search('.single-album-tombstone__meta').text.split('•').map(&:strip)
reviewer = review.search('.authors-detail__display-name').text
review_date = Time.parse(review.search('.pub-date')[0]['title'])
score = review.search('.score').text.to_f
{
artist: artist,
album: album,
label: label,
year: year,
reviewer: reviewer,
review_date: review_date,
score: score
}
end
puts JSON.pretty_generate(reviews)
Put this code in a file called scraper.rb
and run it with the following.
$ ruby scraper.rb
And it should output something like this:
[
{
"artist": "Viet Cong",
"album": "Viet Cong",
"label": "Jagjaguwar",
"year": "2015",
"reviewer": "Ian Cohen",
"review_date": "2015-01-22",
"score": 8.5
},
{
"artist": "Lupe Fiasco",
"album": "Tetsuo & Youth",
"label": "Atlantic / 1st and 15th",
"year": "2015",
"reviewer": "Jayson Greene",
"review_date": "2015-01-22",
"score": 7.2
},
{
"artist": "The Go-Betweens",
"album": "G Stands for Go-Betweens: Volume 1, 1978-1984",
"label": "Domino",
"year": "2015",
"reviewer": "Douglas Wolk",
"review_date": "2015-01-22",
"score": 8.2
},
{
"artist": "The Sidekicks",
"album": "Runners in the Nerved World",
"label": "Epitaph",
"year": "2015",
"reviewer": "Ian Cohen",
"review_date": "2015-01-22",
"score": 7.4
}
]
If you want, you can save this JSON to a file by redirecting standard out to a file.
$ ruby scraper.rb > reviews.json
Conclusion
This only scratches the surface of Mechanize. One thing I haven’t even touched on is it’s ability to fill in and submit forms. If you’re interested in learning more then I recommend you look at the Mechanize guide and Mechanize examples.
A lot of people commented that my previous post should have just used Mechanize from the off. Mechanize is a great tool, but for simple tasks it’s a bit of an overkill.
But the fact that Mechanize handles fetching the remote webpage and respects robots.txt files makes it a useful tool. Even for simple scraping tasks, Mechanize will often be the best tool for the job.
-
You should always scrape responsibly. Check out the Is scraping legal? blog post from ScraperWiki for more discussion on the subject. ↩