TUTORIALS > OPENREFINE > Facets
Once you are comfortable with setting up facets, it is worth it to take a look at just a few of the more advanced methods offered by Refine to consolidate and clean your data for accurate analysis. This tutorial continues using the data from the previous examples, Census_data.csv.
1. In the examples in the previous tutorial, imposing uniformity on non-standardized record values (taking "M", "m", "married" and transforming them into "M") was simple given the small universe of values. But a lot of data consists of a hundreds, maybe even thousands of different values that must all be standardized and grouped in a coherent fashion. In this case, you'll need to use Refine's tools to help you group and transform your set of records with just a few steps.
2. One quick and important batch transformation you may need to do is removing certain characters or removing whitespace. It is often the case that the data you use, unlike the Census_data.csv file, will arrive in a rougher form and thus full of unnecessary characters. Data that has been mutated from text to datasheet, or from one file type to another, will often result in unnecessary characters such as commas, or whitespaces such as tabs and spaces. If you don't remove them, and your analytical software recognizes that whitespace as a character differentiating what would normally be two identical entries, your counts will be off.
To remove those characters, or even to perform basic formatting such as upper case or lower case, from the column header menu of the variable you select, go to Edit Cells >> Common Transformations. This menu will give you a variety of options, including stripping trailing and leading whitespaces.
1. Data with diverse variables (i.e. those with a wide range of types of entries) will need an even more in-depth text/string analysis to help you gather together and make uniform all of your entries. Luckily, Refine has pre-loaded a whole set of text-analysis algorithms--basically, approaches to allow a computer to understand and group text (aka string) entries. These can be accessed by creating a text facet on one of your columns (by way of the column header menu), selecting Facet >> Text Facet.
2. Try working on the variable "Trade," available in the third column from the right in the Census_data.csv file. Create a new text facet based on that variable. From your lefthand facet window, select Cluster on the Trade variable's Text Facet. You will bring up the following window:
Here, you will find all sorts of interesting ways to try to match up similar text strings that you may want to re-encode as a single value. This includes everything from n-grams (analysis of sequences of characters, words, or other clusters) to phonetic-based groupings (i.e. based on letter clusters that might be similar based on sound). What is nice is that you can instantly see what the results will be if you choose one algorithm over another, since the table above updates automatically to show you what will be grouped together based on the parameters you choose. Try playing around with changing both the method and keying function, and don't forget to look up what method is being used--all of these will be amply documented on the web.
3. Once you are pleased with the proposed clustering that Refine will perform, as well as the resulting New Cell Value, click the boxes next to the value pairs you want to merge. Try performing multiple different methods and key functions until you are satisfied with your resulting universe of values.
Building a Data Dictionary
One last neat little advantage to Refine comes in how it helps you quickly build a data dictionary defining all of your codes and presenting users with a list of all your variables and values. Once you have performed all your cleaning, clustering, and refining, you can easily export your list of values for each variable.
To do this, go to your facet on the lefthand window. On the active link that gives the number of "choices" (i.e. unique values available in your data's entries), click to bring up a window listing all those values and their counts. You will now see a quick list, easily copied and pasted elsewhere, of your universe of values and their counts (separated by a tab):
Once you have this information, making a complete data dictionary can be done in almost no time.