Web scraping with R and rvest (includes video & code)

Using SelectorGadget to extract text from a Web page
Screenshot by Sharon Machlis of National Weather Service website using SelectorGadget

Sometimes data you want is available on a Web page, but not in form you can easily download. That's where Web-scraping comes in. Most general-purpose computer languages have a library for easily collecting data from an HTML page. R does too -- a new package called rvest by Hadley Wickham, modeled after Python's Beautiful Soup.

Watch how easy it is to import data from a Web page into R. Code from the video is below.

Note: If you don't have rvest installed on your system, you can download and install it with install.packages("rvest"). Get SelectorGadget at SelectorGadget.com.

Note that CSS can change on Web pages -- in fact, the best CSS for the National Weather Service forecast has already changed in the few weeks since I recorded this video. Another good reason to use SelectorGadget, which makes it easy to find the CSS pattern  you want.

htmlpage <- html("http://forecast.weather.gov/MapClick.php?lat=42.31674913306716&lon=-71.42487878862437&site=all&smap=1#.VRsEpZPF84I")
forecasthtml <- html_nodes(htmlpage, "#detailed-forecast-body b , .forecast-text")
forecast <- html_text(forecasthtml)
paste(forecast, collapse =" ")

To learn more about R, see our free Beginner's Guide to R PDF download For more R screencasts, see the rest of my R in 5 Lines or Less series.

Copyright © 2015 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon