It's an oft-repeated lament that getting your data into shape for analysis and visualization typically takes more time than the actual analysis and visualization. Yet while there are lots of players in the analysis/visualization space, I've encountered fewer commercial or open-source products targeted specifically at data wrangling. (Open Refine comes first to mind; while platforms like Dataiku DSS and Microsoft Power BI also offer wrangling options, for many it's not their only focus.)
Enter Trifacta, whose sole purpose is to help get your data in shape for analysis in other tools such as Tableau.
What it does: The software handles transformations such as changing column data types, filtering based on various criteria, splitting columns on a delimiter, joining and aggregating multiple data sources, and reordering columns. (While reordering may not sound like a big deal, it can be considerably less annoying to click and drag than having to type out the name of 20+ columns in a script).
Trifacta generates a line of code for each drag-and-drop or click action you take, so you can then go in and tweak the script instead of having to do everything via the GUI. There are also additional, more robust functions you can perform via Trifacta's own Wrangle scripting language, such as calculating the difference between two date columns, which don't have a GUI menu option.
Each column within the Trifacta transform editor has a color bar over it showing data quality -- green for the proportion of rows in the column that have entries of the proper type (other colors represent missing records or those that don't appear to be the correct type). Clicking on a section of the bar brings up suggestions such as keeping all the valid data or deleting all rows with missing data in a specific column.
There's a histogram atop each column as well that gives you a basic idea of data distribution.
The free version of Trifacta will pull in .txt, .csv, .json, .log, .gz, .xls and .xlsx files up to 100 MB. The paid version offers more power, additional data sources such as Hadoop and Amazon S3, and functionality such as random sampling. The free version exports in CSV, JSON or TDE (Tableau Data Extract) format.
What's cool: Extract, Split and Replace "suggestion cards" offer regular-expression power without having to write your own regexps. If you highlight text in a column, Trifacta presents several suggested functions such as Extract or Split. When I tested this with a column of city, state data using a "Boston, MA" format, highlighting MA in one record offered easy ways to do some common transformations. For example, mousing over options at the bottom of one suggestion card showed choices such as extracting state abbreviations into a new column -- it recognized the ", MA" as a state abbreviation; other possibilities included extracting all capital letters from that column or selecting everything after a white space before the end of the character string.
The data-quality bar and histogram offer a quick and basic overview of a data set, while the column details view within Trifacta displays more statistical insights, such as median, average, standard deviation, lower and upper quartiles and minimum/maximum values.
Drawbacks: If you've got a large file, only a sample of the first 500KB of your file will appear. That's fine for manipulating and transforming the data, since when you choose to "Generate Results," your actions will be applied to the full data set. However, this is not fine if you assume the data quality and statistical summaries that appear with your data apply to the entire data set. This is especially important since this sample isn't a random sample but simply the first X rows of data, which might already be sorted somehow. Be very careful about relying on statistical summaries and data-quality visuals if working with large files in the Trifacta free version. Once you click Generate Results, you can choose to also export a statistical profile that does indeed apply to the entire file.
Any click-or-drag interface is limited; and while you can do a lot more by using Trifacta's own Wrangle language, you'll have to decide if it's worthwhile to invest that time, especially if you already know another scripting langue (although the Wrangle language doesn't look too complicated).
Finally, you need to sign into a Trifacta account to use the desktop software, which may make some people who work with sensitive data uneasy.
Skill level: Beginner.
Runs on: Windows and OS X.
Bottom line: Like any data product with a graphical user interface, it is easier to use than writing your own scripts from scratch; but also not nearly as flexible as if you were using a language like R. I remain biased toward command-line scripting when wrangling data, since that's always going to offer more power and flexibility. That said, though, I'm sure there are plenty of people who'd prefer to transform data via a graphical user interface. If that's you and you haven't yet found a platform of choice, Trifacta may be an option. Just be aware that beyond the basics, you'll likely need to do a bit of scripting; and if you've got a file larger than 500KB, don't trust the statistical summaries in the Transformer editor and wait until you've generated some results.
Looking for other tools? Check out my chart of 30+ free tools for data visualization and analysis.
China's Sunway TaihuLight theoretical peak performance is 124.5 petaflops.
This sortable chart lets you compare dozens of tools for functionality, skill level and more.
The Windows 10 Anniversary Update is due this summer -- but if you don’t want to wait, you can install...
If a President Obama-backed commission has its way, consumers will one day see cybersecurity ratings on...
Amazon Go could become a paradise for hackers.
A bit of automation can ease the PCI compliance burden.
With spending and hiring flat, CIOs seek efficiency in automation and outsourcing. Should IT workers be...