Last week wasn’t at all that difficult. I mainly spent it reviewing labs and going over regular expressions, which is why I decided not to blog about it. It also served as a means of catching up to all of the work I had to do since I got sick and was absent for a few days. It was tough trying to learn on my own because I didn’t know where to look for resources or project ideas. I mean yeah, watching videos and looking at online tutorials are good, but there’s a limit to how much you can do without hands on practical coding. Flatiron does this perfectly, and this actually made me realize how lucky I am to be here. It has been a fun ride so far. Well that was until we went into the topic of scraping this week. Scraping is by the far the coolest thing I have ever seen in my life. No doubts about it. I think I am one step closer to figuring out how real life programming works and how you can relate it to different scenarios.
Using scraping, you can get info off of any web content online. This may sound boring, but when you think about the possibilities of projects you can do with this information, it becomes that much cooler. Imagine you are building an ebay site or something. You need to find a way to look at the prices of everyone else, and make yours more attractive to potential buyers. You can totally use scraping here to dig out those prices that other people are using, and totally use that to your advantage. Another situation is, which I just thought about actually, is the idea of doing automated tasks. There are many programs out there that can do tasks for you, so what happens if you mix one of those ideas with scraping? You get awesome, custom, automated work done just for you without even having to lift a finger. This is also why a lot of people like metaprogramming. This is definitely something I would like to learn and go into more details.
Anyways, back to explaining how awesome scraping is! Okay, so Imagine you want to go to the movies with a date or a few of your friends over the weekend, but you don’t know what cinema to go to or what times the movies you want are playing. You may also want to check if it is sold out or if you can make reservations (since some theatres allow you to). All you will have to do is check the site and get the data programmatically and then have some form of bot frequently check and refresh the data for you. You can then send yourself an email, or notification of the updates of the movies you are hoping to see. You may be asking yourself: “Okay damian, that’s pretty cool and all, but how the hell do you do this?”. Well it turns out that someone out there already thought of this idea and made life super easy for us rubyists. This beauty of a tool is a ruby gem called
nokogiri. Using this gem, you can get content from any site by simply using some css selectors.
To install the gem simply run:
gem install nokogiri
After you have it installed, simply require it in your program:
Now, let’s scrape a website. For this tutorial, I will use the fandango page.
First go to your terminal and create a ruby file. Here is a screenshot of me setting everything up.
I have no idea why I called it scraping_aboutme, but hey that’s not what really matters right now :p Okay, so now that you have your file and you have required
Nokogiri on the top, you should also require
open-uri to make http requests.
The top of your program should now look like this:
require "nokogiri" require "open-uri"
After having done that, it’s time to start scraping! First, we would need to get back a Nokogiri document, which we can then use to get our data from. Think of this document as a list of nodes instead of html, where we can select and target any element by using some simple css.
nokogiri_document = Nokogiri::HTML(open("http://www.fandango.com/moviesintheaters"))
nokogiri_document contains a nokogiri representation of our data. Now, let’s try to get back the title of a movie:
This will return the first movie it finds, which in this case is The Man From U.N.C.L.E. from the section ‘opening this week’. Now, this isn’t really that useful. It just returns one movie title. What If I want the fifth movie title, or maybe all of the movies listed. All you have to do is change the
at_css method to
css. This method, instead of returning only the first match, will return all of the possible matches in an array. Now, if we would want to print all of the movie titles, we would just need to iterate over the array and print them out!
nokogiri_document.css("li.visual-item .visual-title").each do |movie| puts movie.text.strip end
Super easy, right? And there you have it! You successfully scraped the fandango movie site to get all of the movies listed. Now if you wanted, you can find out the css selector to get just the ‘Opening this week’ movies or maybe even the ‘Now Playing’. You can get super creative and parse the movie ticket prices, and whether or not they are sold. The possibilities of this are endless, which is why it makes scraping so much fun! The complete code snippet of this tutorial can be found below:
require "nokogiri" require "open-uri" def movie_info nokogiri_document = Nokogiri::HTML(open("http://www.fandango.com/moviesintheaters")) nokogiri_document.css("li.visual-item .visual-title").each do |movie| puts movie.text.strip end end movie_info