Remote participants, please use this link to attend this session:Â https://connect.clemson.edu/openrefine/ The purpose of this tutorial is to introduce you to powerful, yet user-friendly tools that allow you to clean, analyze, and visualize your tabular data. In this example we are interested in looking at the relationship between population, campaign contributions, and electoral votes results. The following dataset named Elections_2012_2016.csv has been compiled from three different sources: Federal Election Commission, U.S. Census Bureau, and Wikipedia. Download this table to your computer.Exploring the Data: ElectionsDo you see any problems with the way the data has been collected? Remember where you saved your table. We will be using it in our next step with Open Refine.Transforming your data with OpenRefineOpenRefine is a free, powerful tool that allows you to clean, discover, organize messy data and link it on the fly to other databases.The main reasons for using OpenRefine are: - It's free, open source, and your data is stored locally so you can clean your data without any privacy issues - It combines the power of scripting with the simplicity of spreadsheets, which makes it more interactive and experimental than other tools - User-friendly design that allows for easily detecting and fixing disparities in your data File formats supported by OpenRefine are: Comma-Separated Values (CSV), Excel documents (.xls, .xlsx), Open Document Format spreadsheets (.ODS), JSON and XML. If you need other formats for your data, you can add them by way of OpenRefine extensions. Launch OpenRefine. This should automatically open your web browser to the correct address: 127.0.0.1:3333. In OpenRefine, on the right-hand side, select Get data from This Computer and click the Browse... button. Locate the States data mentioned above and click Open. Press the Next button to preview your data. In the screen that appears, make sure to uncheck Store blank rows (bottom right corner of your screen). If your data looks correct, press the Create Project button on the top right corner. Transforming your data with GRELOne big difference between OpenRefine and Excel is the unit of interaction. In spreadsheets, cells are the main units of interaction. However, OpenRefine works with columns and rows instead. Editing usually happens one column at a time or across rows that meet a specific criteria. The second main difference is that spreadsheets are primarily used for performing calculations. The goal of OpenRefine is to explore your data, look for patterns, and transform it.
Let's get familiar with common operations. For example, change the number of records displayed by selecting 50 records on the Show menu at the top center of your screen. We have decided that the Rank on the states in 2010 is not important and want to delete the column. For that, click on the drop-down arrow at the left of the column title. Select Edit column > Remove this column. We want to have the State names as the first column that appears. To move the States or territories column and rename it, click the drop-down arrow next to it, select Edit column > Move column to beginning. Rename the column to just State by selecting Edit column > Rename this column. Under Enter new column name, type: State. Click OK. On your own: remove the column Census population, April 1, 2010. Removing Characters from a String Notice the strange A at the beginning of each State name. We want to remove the character and white space before each State name. To clean the names, click the inverted triangle to the left of State, select Edit cells > Transform. In the window that appears, under Expression type: value.substring(2). This function returns the text of the column starting with the character that has position 2 (remember the first one starts with 0) until the end of the text.Try other combinations such as substring(1) and substring(0,4) to understand how this function works. For more information about working with text functions, refer to the GREL String Functions reference. How would you subtract the first two characters of each State (Ca)? GREL is the General Refine Expression Language (GREL) used to write transformation expressions in OpenRefine. However, you can also use Jython (implementation of Python designed to run on the Java platform) and Clojure (dialect of Lisp). Type again value.substring(2) to get the full names on the State column. Click OK. Tip: An easier way to do this is to import the data into OpenRefine typing UTF-8 in the Character Encoding box. To learn more about character encoding, visit: https://www.w3.org/International/questions/qa-what-is-encoding Removing White Trailing SpacesOne big problem with geocoding names in the presence of white trailing spaces at the beginning or the end of the name. In your table, hover over the California cell and press edit. Notice there is a space after California. These are very hard to find, so instead of trying to find them manually, let's clean the entire column with a simple function. Click Cancel. For expression type: value.trim(). Remember to make a cell transformation, go to Edit cells > Transform. To find out how many records have been affected by your last operation, go to the Undo/Redo tab on the top left corner. Here you have a history of all the steps you have made so far. Notice your last step affected 5 state names. You can undo any step and redo by simply clicking on the previous step. If you press Extract, you can get the code for any of these steps in JSON format. This can be very useful if you want to use the same procedure with other datasets that have exactly the same format (perhaps for different election years). Look at the Total seats in House column. The values are considered text. To change this, go to the column and select Edit cells > Common transformations > To number. Notice how the values are now green (way to represent numeric values in OpenRefine). Tale a look at the US Elections Results2012 column. This column contains information about how each state voted in the past 2012 elections. There should be only two values: D and R. Notice the variation. To know exactly how many different values we have, go to this column and select Facet > Text Facet. You should see 13 different choices for the results. Notice the similarities. Next to each choice you have the number of records in your table with that choice. To manually change the choices, click on D. and edit. In the box that appears, remove the dot (D) Click Apply. Notice how you have only 12 choices now. We are going to cluster the data to combine the choices in a faster, coherent way. In the same column, select Edit cells > Cluster and edit... This tool allows you to find "groups" of cell values in your columns that are similar and cluster them using different algorithms. Examine the clusters. Did it do a good job guessing similar values? Click on Select All. Notice the Merge function checked for each cluster. In the New Cell Value, type R, D and R. Click Merge Selected & Close. If you look at the Text Facet, you still have 4 choices instead of 2. Change manually CON to R and lib to D by selecting the choice on the Text Facet window and typing the new text for each. Clean the US Elections Results 2016 column. Instead of using the same steps, click on Undo/Redo tab and select Extract. Click Unselect All and select the last three operations. Copy the code. Click Apply, change the column name to the correct one and click Perform Operations. Move the Contributions all candidates 2012 column to the left of the State Flag column by selecting: Edit column > Move column left twice. Notice the variety of values. Think for a second how can we convert this column into numbers that are consistent in value. First, let's get rid of $ in our records. One of the most common expressions is the replace function. Transform the column (Edit cells > Transform) by typing: value.replace("$",""). How many cells got transformed? Now let's work with the cells that have the word million on them. We obviously want to get rid of the word, but also give them a numeric value, since 91(million) and 791,000 are not equivalent. We will achieve this with two steps: Click on the Contributions column again, and create a custom test facet (Facet > Custom text facet) to locate all the rows with the word million in them: value.contains("million") Click OK. In the facet window, the true choice gives you the number of cells with that word. How many do you have? Click on true to work only with those records. Now we are going to remove the word million, convert the value to number, and multiply it by 1000000 in just one line. In the Contributions column, select Edit cells > Transform and type the following expression: toNumber(value.replace(" million",""))*1000000 How many records have you transformed? Remove the text facet to see the entire table. Notice the values in green that mean those cells are numbers. Select Edit cells > Common transforms > To number. This will transform two of the cells into numbers. However, some of them have not converted (still in black, like in West Virginia). What is the common format they have? To get rid of the comma, which expression would you use that we have seen earlier? The replace expression. Go to Edit cells > Transform and type: value.replace(",","") Transform your column to number again. Notice your results. Anything unusual in your values? Extra: try to convert to number and replace the comma in one line using the transform function. To make sure you don't have data inconsistencies, create a numeric facet (Facet > Numeric Facet). Notice the big variation in values. Is this expected? To make it more clear, close the numeric facet and create a log facet (Facet > Customized facets > Numeric log facet). This clearly shows two values extremely low compared to the rest. Move the handles on the bar for the facet so you can see only two records. Obviously this is a data entry error: Texas and Florida did not raise $53.6 and $36.5. The numbers are millions. You can change these values manually. Make sure you convert your results to number. On your own:Follow the same procedure to transform the Contributions all candidates in 2016 to proper to numbers. Let's now focus on the States Electoral votes column. Move this column to the left of the State Flag column. Notice the data error in all entries. Remove the ! character using the replace function: value.replace("!","") Click OK. In the same column, click again on the down arrow and select Edit cells > Common Transforms > To number. Click OK. Try again as one line. Based on the functions that we learned earlier, what would be another way to get the same values (removing the ! character and converting it to number)? Solution: value.substring(5) - convert to number. Extra points: toNumber(value.substring(5)) Transform Population estimates and Percent of total US pop to numbers. Solutions: toNumber(value.replace(",","")) toNumber(value.replace("%","")) value.replace(/[^0-9.]/,"").toNumber()*0.01 Scatterplot FacetGo to the Population estimates July 1, 2015 column and select: Facet > Scatterplot Facet. Notice the new window that appears plotting this column against other numeric values in your data. Click on the graphic that shows population estimate vs. contributions all candidates. Notice how it gets inserted in your top left corner. Select the States with the highest values by drawing a rectangle around them and see your data get filtered by your selection. Exporting your data and historyOnce your data is ready for your next step in other programs such as Google Fusion Tables or Tableau, click the Export button on the top right corner and select Comma Separated Value. Save your File to your computer: States_pop_2015_elections2012-csv.csv We will use this dataset for the visualization tutorial with Fusion Tables that you can find here.Â
OpenRefine ResourcesOpenRefine Recipes Tutorial by David Huynh Merging Datasets with Common Columns by Tony Hirst Tutorial by Keith Jenkins at Cornell University |
Cleaning Data with Open Refine
Subpages (1):Pivot Tables & OpenRefine