Cleaning Data with Open Refine

Remote participants, please use this link to attend this session: https://connect.clemson.edu/openrefine/

The purpose of this tutorial is to introduce you to powerful, yet user-friendly tools that allow you to clean, analyze, and visualize your tabular data. In this example we are interested in looking at the relationship between population, campaign contributions, and electoral votes results.

Exploring the Data: Elections

The following dataset named Elections_2012_2016.csv has been compiled from three different sources: Federal Election Commission, U.S. Census Bureau, and Wikipedia. Download this table to your computer.
Open the table and spend a few minutes getting familiar with the structure and the meaning of each field. Think about relations, patterns, questions that you would like to solve or visualize with this data.
Do you see any problems with the way the data has been collected?
Remember where you saved your table. We will be using it in our next step with Open Refine.

Transforming your data with OpenRefine

OpenRefine is a free, powerful tool that allows you to clean, discover, organize messy data and link it on the fly to other databases.
The main reasons for using OpenRefine are:
- It's free, open source, and your data is stored locally so you can clean your data without any privacy issues
- It combines the power of scripting with the simplicity of spreadsheets, which makes it more interactive and experimental than other tools
- User-friendly design that allows for easily detecting and fixing disparities in your data

File formats supported by OpenRefine are: Comma-Separated Values (CSV), Excel documents (.xls, .xlsx), Open Document Format spreadsheets (.ODS), JSON and XML. If you need other formats for your data, you can add them by way of OpenRefine extensions.
You can find excellent resources at: openrefine.org  and download the program here.

Importing data

Launch OpenRefine. This should automatically open your web browser to the correct address: 127.0.0.1:3333.

In OpenRefine, on the right-hand side, select Get data from This Computer and click the Browse... button. Locate the States data mentioned above and click Open.

Press the Next button to preview your data.

In the screen that appears, make sure to uncheck Store blank rows (bottom right corner of your screen).

If your data looks correct, press the Create Project button on the top right corner.

Transforming your data with GREL

One big difference between OpenRefine and Excel is the unit of interaction. In spreadsheets, cells are the main units of interaction. However, OpenRefine works with columns and rows instead. Editing usually happens one column at a time or across rows that meet a specific criteria.

The second main difference is that spreadsheets are primarily used for performing calculations. The goal of OpenRefine is to explore your data, look for patterns, and transform it.

Let's get familiar with common operations. For example, change the number of records displayed by selecting 50 records on the Show menu at the top center of your screen.

We have decided that the Rank on the states in 2010 is not important and want to delete the column. For that, click on the drop-down arrow at the left of the column title. Select Edit column > Remove this column.

We want to have the State names as the first column that appears. To move the States or territories column and rename it, click the drop-down arrow next to it, select Edit column > Move column to beginning. Rename the column to just State by selecting Edit column > Rename this column. Under Enter new column name, type: State. Click OK.

On your own: remove the column Census population, April 1, 2010.

Removing Characters from a String

Notice the strange A at the beginning of each State name. We want to remove the character and white space before each State name.

To clean the names, click the inverted triangle to the left of State, select Edit cells > Transform. In the window that appears, under Expression type: value.substring(2).

This function returns the text of the column starting with the character that has position 2 (remember the first one starts with 0) until the end of the text.

Try other combinations such as substring(1) and substring(0,4) to understand how this function works. For more information about working with text functions, refer to the GREL String Functions reference.

How would you subtract the first two characters of each State (Ca)?

GREL is the General Refine Expression Language (GREL) used to write transformation expressions in OpenRefine. However, you can also use Jython (implementation of Python designed to run on the Java platform) and Clojure (dialect of Lisp).

Type again value.substring(2) to get the full names on the State column. Click OK.
Tip: An easier way to do this is to import the data into OpenRefine typing UTF-8 in the Character Encoding box. To learn more about character encoding, visit: https://www.w3.org/International/questions/qa-what-is-encoding

Removing White Trailing Spaces

One big problem with geocoding names in the presence of white trailing spaces at the beginning or the end of the name. In your table, hover over the California cell and press edit. Notice there is a space after California. These are very hard to find, so instead of trying to find them manually, let's clean the entire column with a simple function. Click Cancel. 

For expression type:
value.trim(). Remember to make a cell transformation, go to Edit cells > Transform. 
You cannot notice the difference in this window. Click OK.


To find out how many records have been affected by your last operation, go to the Undo/Redo tab on the top left corner. Here you have a history of all the steps you have made so far. 

Notice your last step affected 5 state names. 

You can undo any step and redo by simply clicking on the previous step. 

If you press Extract, you can get the code for any of these steps in JSON format. This can be very useful if you want to use the same procedure with other datasets that have exactly the same format (perhaps for different election years).

Converting data from string to numeric values

Look at the Total seats in House column. The values are considered text. To change this, go to the column and select Edit cells > Common transformations > To number. Notice how the values are now green (way to represent numeric values in OpenRefine).

Exploring and Transforming your data using Facets: Democrats Vs. Republicans

Tale a look at the US Elections Results2012 column. This column contains information about how each state voted in the past 2012 elections. There should be only two values: D and R. Notice the variation. To know exactly how many different values we have, go to this column and select Facet > Text Facet.

You should see 13 different choices for the results. Notice the similarities. Next to each choice you have the number of records in your table with that choice.
To manually change the choices, click on D. and edit. In the box that appears, remove the dot (D) Click Apply. Notice how you have only 12 choices now.

We are going to cluster the data to combine the choices in a faster, coherent way.

In the same column, select Edit cells > Cluster and edit...

This tool allows you to find "groups" of cell values in your columns that are similar and cluster them using different algorithms.

Examine the clusters. Did it do a good job guessing similar values?

Click on Select All. Notice the Merge function checked for each cluster. In the New Cell Value, type R, D and R. Click Merge Selected & Close.

If you look at the Text Facet, you still have 4 choices instead of 2. Change manually CON to R and lib to D by selecting the choice on the Text Facet window and typing the new text for each.

Clean the US Elections Results 2016 column.
Instead of using the same steps, click on Undo/Redo tab and select Extract. Click Unselect All and select the last three operations. Copy the code. Click Apply, change the column name to the correct one and click Perform Operations.

Transforming currency data

Move the Contributions all candidates 2012 column to the left of the State Flag column by selecting: Edit column > Move column left twice. Notice the variety of values.

Think for a second how can we convert this column into numbers that are consistent in value.

First, let's get rid of $ in our records. One of the most common expressions is the replace function. 

Transform the column (Edit cells > Transform) by typing: value.replace("$",""). How many cells got transformed?

Now let's work with the cells that have the word million on them. We obviously want to get rid of the word, but also give them a numeric value, since 91(million) and 791,000 are not equivalent. We will achieve this with two steps:
Click on the Contributions column again, and create a custom test facet (Facet > Custom text facet) to locate all the rows with the word million in them:
value.contains("million")
Click OK. In the facet window, the true choice gives you the number of cells with that word. How many do you have?

Click on true to work only with those records. Now we are going to remove the word million, convert the value to number, and multiply it by 1000000 in just one line. 

In the Contributions column, select Edit cells > Transform and type the following expression:
toNumber(value.replace(" million",""))*1000000

How many records have you transformed? Remove the text facet to see the entire table. 

Notice the values in green that mean those cells are numbers. Select Edit cells > Common transforms > To number. This will transform two of the cells into numbers. However, some of them have not converted (still in black, like in West Virginia). What is the common format they have?

To get rid of the comma, which expression would you use that we have seen earlier? The replace expression. Go to Edit cells > Transform and type: value.replace(",","")
Transform your column to number again. Notice your results. Anything unusual in your values?

Extra: try to convert to number and replace the comma in one line using the transform function.

To make sure you don't have data inconsistencies, create a numeric facet (Facet > Numeric Facet). Notice the big variation in values. Is this expected? 

To make it more clear, close the numeric facet and create a log facet (Facet > Customized facets > Numeric log facet).

This clearly shows two values extremely low compared to the rest. 

Move the handles on the bar for the facet so you can see only two records. 

Obviously this is a data entry error: Texas and Florida did not raise $53.6 and $36.5. The numbers are millions. 

You can change these values manually. 

Make sure you convert your results to number.

On your own:

Follow the same procedure to transform the  Contributions all candidates in 2016 to proper to numbers.

Working with String Data - Review

Let's now focus on the States Electoral votes column. Move this column to the left of the State Flag column. Notice the data error in all entries. Remove the ! character using the replace function:
value.replace("!","")

Click OK. In the same column, click again on the down arrow and select Edit cells > Common Transforms > To number. Click OK. Try again as one line.

Based on the functions that we learned earlier, what would be another way to get the same values (removing the ! character and converting it to number)?

Solution: value.substring(5) - convert to number. Extra points: toNumber(value.substring(5))

On your own:

Transform Population estimates and Percent of total US pop to numbers. 

Solutions:
toNumber(value.replace(",",""))
toNumber(value.replace("%",""))
value.replace(/[^0-9.]/,"").toNumber()*0.01

Scatterplot Facet

Go to the Population estimates July 1, 2015 column and select: Facet > Scatterplot Facet. Notice the new window that appears plotting this column against other numeric values in your data.

Click on the graphic that shows population estimate vs. contributions all candidates. Notice how it gets inserted in your top left corner. Select the States with the highest values by drawing a rectangle around them and see your data get filtered by your selection.

Exporting your data and history

Once your data is ready for your next step in other programs such as Google Fusion Tables or Tableau, click the Export button on the top right corner and select Comma Separated Value. Save your File to your computer: States_pop_2015_elections2012-csv.csv

We will use this dataset for the visualization tutorial with Fusion Tables that you can find here. 

Tips

  • Look for patterns in your data by identifying the common problem in the cells or rows that need change.
  • Remember to use filters and/or facets to isolate the rows you want to change.
  • Concatenate syntax to produce a single command to make your edits.
  • Extract your operation history so you can:
    • Use the same transformation to similar documents
    • Others can perform the same series of procedures to similar data

OpenRefine Resources

OpenRefine Wiki: GREL Reference: GREL String Functions
OpenRefine Recipes
Tutorial by David Huynh
Tutorial by Keith Jenkins at Cornell University
    - clustering
SelectionFile type iconFile nameDescriptionSizeRevisionTimeUser

Download
 11kv. 1 Feb 8, 2017, 10:34 AMPatricia Dale

Download
 11kv. 1 Sep 6, 2016, 3:27 PMPatricia Dale