Easy Web scraping with Import.io

Want to quickly see and store federal IT jobs listings from USAJobs? Follow along as we simplify the process of pulling usable data from the website, using both illustrations and videos.

There's lots of data on the Web and plenty of ways to "scrape" it -- to mine information from a website and store it in the format you want. Even if you're comfortable coding, though, it can be a pain to figure out how to extract data from a complex HTML page.

Import.io is one of several products and services that aim to simplify the process. What's compelling about Import.io is that it's both free and fairly easy to use -- even for sites that use JavaScript and present results in multiple pages, such as USAJobs, the federal jobs site.

Want to quickly see and store all federal IT jobs listings from USAJobs? Follow along as we create our own crawler and API.

Install the Import.io app

Government tech job search results look like this when you get them from the Web, making it difficult to sort and compare listings -- especially across multiple pages. Scraping lets you save everything in spreadsheet-like format.

To start, head to Import.io to download, install and launch the app, which is available for Windows, OS X and Linux. You'll also need to sign up for an Import.io account -- free is fine unless you think you'll be making more than 250,000 page calls per day -- or sign in with a Facebook, GitHub, Google or LinkedIn account.

Create a data source

After you launch the application and sign in, you'll see options to create either a data set or a new data source. Your first step is to create a data source (since, naturally, there's no data to store without one). Click on that center New Data Source box.

Connect to the USAJobs site

Enter http://www.USAJobs.gov in the browser address bar up top and click the pink "Let's get cracking" bar at the bottom right. You'll see three choices for data extraction: Extractors, Crawlers and Connectors; hovering over each will explain it.  I discovered from trial and error (and Import.io support) that a Connector is needed for sites that use JavaScript, which USAjobs.gov does. So, select the box with the Connector's magnifying glass.

Connect the sites

When asked if the site requires a login, choose "No, it doesn't." You'll then be told to go where you need to on the site to search. You're already on the search page, so click the pink "I'm there" bar in the bottom right. Don't start typing your query until after clicking "I'm there!"

Record your actions

Hit the red record button at the bottom right to start recording your actions and begin typing in the USAjobs.gov Keyword search box. You'll be asked if the website is working. For USAjobs.gov, it won't be (because JavaScript won't yet be enabled), so click the "Not working?" button. When you do, Import.io will enable JavaScript. Now it should be working.

Type 2200 (the code for IT) in the Keyword search box and click the website search button. When the job results appear, click the Import.io black stop-recording button. You'll be asked if the data you want is displayed in the browser. Click Yes or No as appropriate.

Name your input field

In the bottom right of the browser, you'll be asked to "associate one or more inputs below with the values you entered." That's basically just asking you to name your input field.

Click the pink "Make input" button next to 2200 and you'll be given an option to name the field for future use. Accepting the default "keyword" and field type text is fine. There should then be a pink bar you can click to "Take me to the next step."

Specify what you want to extract

You'll now be asked if you've got just one thing to extract on each page or "lots of the same thing." We have "lots of the same thing," since there are 25 job listings per page. Select that middle option.

You'll be asked if the site has a total results count that you'd like to save. Since total results are displayed on the page, choose yes. You'll then be asked if you want to train the Connector to look at the other results pages. We definitely do!

Train your connector

It's now time to "train" your connector. This just means typing in a couple of search terms to get results. Click the pink "Add example" button and type in something like 2200, 2210 (IT Management) or INFOSEC and click the pink Query button.

The app will run for a bit and then (hopefully) show your results, as you can see in the video accompanying this slide.

Tell Import.io what a "row" is

Here's where the project gets interesting: Telling Import.io what info you want and how to structure your data.

You do this by first telling Import.io what a "row" is -- equivalent to a row in Excel or a record in a database. After that, you'll choose which items in each "row" should be in what columns -- that's how you go from paragraph-like blocks of text in search results to a more spreadsheet-like format. This video will show you how.

To train a row, highlight everything in one row and click "Train row." Keep doing this until all 25 result "rows" are selected. In this case, I needed to "train" three times to get all results on the page; you likely will, too.

Create columns

Now that Import.io knows what a row is, you can decide on your columns. This is up to you -- only data you want needs to be assigned to a column. 

Start by highlighting something you want to be in a column -- say the first job title. Click the pink "Add column" button, name your column and select the field type you want. Continue with as many columns as you want, then click "I've got what I need." You'll then be asked to train total results by highlighting the number on the page and clicking "Train total results."

We've included this video to help.

Keep going

You should be asked to go to page 2. Click on page 2, wait for it to load and then click "I'm on page 2." You should see results similar to this, with rows highlighted in alternate colors at the top and data in rows and columns at the bottom left. (I trained a total of four columns for this example.)

If all is okay, click "I've got all 25 rows" and then "I've got what I need."

Train the connector again

You may be asked to train the connector again; this time, try entering 2200 for all IT jobs, wait for results and click through the various confirmations just as you did the first time (without adding any columns). Then choose "I'm done creating tests."

You'll now be given an option to "Upload to import.io" and name your connector. After that, there'll be an option to "Show me the data!"

You're done!

Here's what my data looks like with both the 2210 and 2200 queries: all the federal IT job listings in one place. You can mouse over a query to see the option to remove it.

At the top right, you'll have options to download in Excel, HTML, JSON or CSV, as well more complex API integrations with Excel, Google Sheets and several programming languages. (Although for Excel and Google Sheets, unfortunately, you'll only get one page of results through the API, so a manual download is preferable.)

For more on how these work, you can click options at the bottom left of the Import.io screen.

Good luck!

is online managing editor at Computerworld.

Copyright © 2014 IDG Communications, Inc.