It's an oft-repeated lament that getting your data into shape for analysis and visualization typically takes more time than the actual analysis and visualization. Yet while there are lots of players in the analysis/visualization space, I've encountered fewer commercial or open-source products targeted specifically at data wrangling. (Open Refine comes first to mind; while platforms like Dataiku DSS and Microsoft Power BI also offer wrangling options, for many it's not their only focus.)
Enter Trifacta, whose sole purpose is to help get your data in shape for analysis in other tools such as Tableau.
What it does: The software handles transformations such as changing column data types, filtering based on various criteria, splitting columns on a delimiter, joining and aggregating multiple data sources, and reordering columns. (While reordering may not sound like a big deal, it can be considerably less annoying to click and drag than having to type out the name of 20+ columns in a script).
Trifacta generates a line of code for each drag-and-drop or click action you take, so you can then go in and tweak the script instead of having to do everything via the GUI. There are also additional, more robust functions you can perform via Trifacta's own Wrangle scripting language, such as calculating the difference between two date columns, which don't have a GUI menu option.
Each column within the Trifacta transform editor has a color bar over it showing data quality -- green for the proportion of rows in the column that have entries of the proper type (other colors represent missing records or those that don't appear to be the correct type). Clicking on a section of the bar brings up suggestions such as keeping all the valid data or deleting all rows with missing data in a specific column.
There's a histogram atop each column as well that gives you a basic idea of data distribution.
The free version of Trifacta will pull in .txt, .csv, .json, .log, .gz, .xls and .xlsx files up to 100 MB. The paid version offers more power, additional data sources such as Hadoop and Amazon S3, and functionality such as random sampling. The free version exports in CSV, JSON or TDE (Tableau Data Extract) format.
What's cool: Extract, Split and Replace "suggestion cards" offer regular-expression power without having to write your own regexps. If you highlight text in a column, Trifacta presents several suggested functions such as Extract or Split. When I tested this with a column of city, state data using a "Boston, MA" format, highlighting MA in one record offered easy ways to do some common transformations. For example, mousing over options at the bottom of one suggestion card showed choices such as extracting state abbreviations into a new column -- it recognized the ", MA" as a state abbreviation; other possibilities included extracting all capital letters from that column or selecting everything after a white space before the end of the character string.
The data-quality bar and histogram offer a quick and basic overview of a data set, while the column details view within Trifacta displays more statistical insights, such as median, average, standard deviation, lower and upper quartiles and minimum/maximum values.
Drawbacks: If you've got a large file, only a sample of the first 500KB of your file will appear. That's fine for manipulating and transforming the data, since when you choose to "Generate Results," your actions will be applied to the full data set. However, this is not fine if you assume the data quality and statistical summaries that appear with your data apply to the entire data set. This is especially important since this sample isn't a random sample but simply the first X rows of data, which might already be sorted somehow. Be very careful about relying on statistical summaries and data-quality visuals if working with large files in the Trifacta free version. Once you click Generate Results, you can choose to also export a statistical profile that does indeed apply to the entire file.
Any click-or-drag interface is limited; and while you can do a lot more by using Trifacta's own Wrangle language, you'll have to decide if it's worthwhile to invest that time, especially if you already know another scripting langue (although the Wrangle language doesn't look too complicated).
Finally, you need to sign into a Trifacta account to use the desktop software, which may make some people who work with sensitive data uneasy.
Skill level: Beginner.
Runs on: Windows and OS X.
Bottom line: Like any data product with a graphical user interface, it is easier to use than writing your own scripts from scratch; but also not nearly as flexible as if you were using a language like R. I remain biased toward command-line scripting when wrangling data, since that's always going to offer more power and flexibility. That said, though, I'm sure there are plenty of people who'd prefer to transform data via a graphical user interface. If that's you and you haven't yet found a platform of choice, Trifacta may be an option. Just be aware that beyond the basics, you'll likely need to do a bit of scripting; and if you've got a file larger than 500KB, don't trust the statistical summaries in the Transformer editor and wait until you've generated some results.
Looking for other tools? Check out my chart of 30+ free tools for data visualization and analysis.
Datafest! Highlights from Open Data Science Conference BostonNext Post
Esri offers free aid to agencies battling disasters
China's Sunway TaihuLight theoretical peak performance is 124.5 petaflops.
An unassuming option can change the way you think about mobile technology -- but only if you see it for...
A Virginia couple and four other people have been indicted for running an H-1B visa-for-sale scheme the...
Sponsored by Informatica
While the Amazon Echo and Google Home brought the digital assistant to the home, Staples wants to do...
As tech giants battle for cloud dominance, businesses looking to offload heavy-duty infrastructure...
If you want to protect sensitive data -- especially if you’re sending it via email or via an online...
The intrinsic value of AT&T's proposed $85 billion merger with Time Warner is based primarily on growth...