#StandWithUkraine

Google Refine for messy data cleanup

Google Refine icon When there is even slight potential that something in spreadsheet(-like) data can be generated or filled wrong - it usually will. Fixing up data to be useful usually involves spreadsheet editor, wasted time and a lot of evil thoughts radiated.

Google Refine looks loosely like yet another spreadsheet editor, but is built precisely for fixing and cleaning up data.

What it does

Unlike most Google products Refine is software that needs to be downloaded and run locally. Technically it works like a local web server, which from practical point of view means that its interface is a web page in your browser of choice.

Google Refine interface

Data can be imported and exported in variety of formats and in-between is saved as part of Refine project.

General workflow is essentially picking and using facets to isolate sets broken/undesirable values and editing, removing or otherwise getting rid of them.

Strong features

Facets can be picked from predefined suggestions or constructed from scratch, using functions from provided expression language. They can range from very simplistic (facet by presence of specific word) to very complex (facet by largest Unicode character code, present in value).

For textual values Refine offers clustering feature that runs values through different algorithms and suggests those that are similar to be grouped together.

While spreadsheets editors often rely on dynamic values (columns based on other columns in some way), Refine treats data as static and applies all changes immediately. However all changes go into a log, which allows to roll back to any point as well as export changes as set of instructions an run them differently or even in different project.

Downsides

Refine is not overly user-friendly application. It gives user a lot of control but cares little for some simple things that would make a difference. For example it is somewhat weak with type casting - as result empty (null) values tend to wreak havoc as inappropriate input for operators. Easily fixable by changing them into empty strings (see the meaningful difference? neither do I) but after umpteen times just makes you wish it did that on its own.

Web page as interface is surprisingly robust. Still there are some things that just don’t work usual way (like aimlessly scrolling through) and facets tend to evaporate when project is closed, which is huge pain when you just nailed that perfect setup for the task.

Overall

Google Refine is rare combination of functionality that is both specific and flexible, while free and actively developed. It has plenty of learning curve (start with introduction videos) but it pays off well by turning data cleanup into relatively smooth and almost enjoyable process.

Home http://code.google.com/p/google-refine/

Related Posts