Web Scraping. What is it good for?

Colten Appleby
4 min readMay 28, 2021

--

Scraping a website can be an extremely powerful exercise to retrieve data from a website for your own project. You may want to retrieve how often Joe Biden is mentioned in a New York Times article or how many home runs Mike Trout has hit in his career. The possibilities are endless. In this blog post I am going to walk through how to use the Ruby tools you need to grab the webpage, convert it to a readable format and then parse the webpage to grab the specific piece of information that you need.

URI

The first tool is URI. URI will grab the webpage and convert it into a file format that Nokogiri can use to convert it further into a parseable format.

Nokogiri

The main attraction…

Nokogiri is a tool that will convert a URI-created object into HTML or XML and will then allow you to parse it into data that you want. A note on parsing — parsing is the process of converting raw data into something that is readable. This could be the process of converting a long string of text into a single word or even a large database table into a smaller data structure. Nokogiri, as we will see later can breakdown a large html file by css selectors to grab the specific information that we want!

Installing URI and Nokogiri

Install these two gems is as easy as pie. Just install both gems in your terminal and then remember to run “bundle install” after both have been downloaded.

gem install open-urigem install nokogiri

Using both gems in your program

After install both gems and write the below require statements at the top of your Ruby file.

require 'open-uri"
require "nokogiri"

Prime Example

Now it is time to actually scrape a webpage and extract the specific piece of information that we want! In this example we are going to grab the number of jobs that have been posted on indeeds website for a specific job search in an area. This will be Software Engineering within 25 miles of Midtown Manhattan (10019). Note this will include remote jobs. The first thing we need is the website url.

query_url = "https://www.indeed.com/jobs?q=software+engineer&l=10019"

Now we send the url to URI to be scrapped.

web_scrape = URI.open(query_url)

This is what is saved as “web_scrape.” What does this even mean?

#<File:0x00007fc82d9e8de0>

Now we use Nokogiri to convert this File into a document that we can make some sense of. In our case we are going to convert the file into HTML but Nokogiri can also convert it into XML.

doc = Nokogiri::HTML(web_scrape)

Printing doc results in the below print screen. As you can see this is extremely difficult to read, however it is much better than the File from the output of web_scrape. Finally this is something that we can work with!

printing “doc”

Now we go to the browser to find the css selector that we need to parse this massive text block into some useful information! Go to the url that we put into our query_url and right click then go down to the bottom and click inspect.

Right Click on the webpage and hit Inspect or Inspect Element
Dev Tools

After the DEV tools open click the small mouse looking thing in the top left. This will allow you to hover over the webpage and select different elements on the webpage and collect information about them.

Above we hovered over the page 1 of X,XXX jobs and this popped up. The key is the purple and blue text at the top. It says div#searchCountPages. We can now tell Nokogiri to look for this element and pull this specific piece of data.

content = doc.css("div#searchCountPages").text
printing “content”

Wow! Thats exactly what we wanted! With some basic ruby parsing of the “Page 1 of 6,854 jobs” text we will get 6854 which can easily be loaded into a database!

content.split("of ")[1].split(" jobs")[0].split(",").join()
# 6854

--

--

Colten Appleby
Colten Appleby

Written by Colten Appleby

Student in immersive software development bootcamp

No responses yet