Tidy Data Case Studies Using R
In the previous article in this tidy data series, I introduced the concept of tidy data and highlighted the principles of tidy data. If you have not read the last article, you can read it here.
Before we dive right into these case studies, let me refresh your knowledge about the guiding principles of tidy data. The important principles of tidy data are:
- Each column should hold a single variable
- Each row should hold a single observation.
- Each cell should hold a single value (if you follow the first two rules, then each cell should have a single value)
With this refresher, we will get our hands dirty with illustrations of the principles mentioned earlier in this article.
Since it can be hard to read in the data and get it to conform to the format, you ultimately need to answer a data-driven question. Due to this, conceptual frameworks and software packages to make these steps easier have been developed.
Hadley Wickham (Chief Scientist at RStudio), who is the godfather of the tidyverse (put a link to the official documentation here) collection of R packages, created the tidyr package to help you go from messy to tidy data. The other packages in the tidyverse family are designed around the idea that your data is in a tidy format. They work so well because tidy datasets are all alike, while every messy dataset is just messy in its way.
Let us now see how different functions in the tidyr package can help turn messy data to conform to the tidy data principles mentioned above.
Let us get started with some preambles.
First, to set out, we will install the tidyverse package since it contains the functions we will use for our work here.
Before we go on, note that I will not exhaustively show you how to use each function (with all its arguments) since this is an introduction to the tidyr package. However, this article will quickly build a solid foundation to pick up these functions for more use cases.
You can find all the data sets and script here on GitHub for reproducibility.
If you are ready and excited. Let us enjoy this ride together.
Principle One: Each Variable is stored in a column
To obtain data that conforms to this rule, we will use the pivot_longer(), and pivot_wider() functions.
a. Pivot longer
pivot_longer() — “lengthens” data, increasing the number of rows and decreasing the number of columns. Its’ opposite is pivot_wider().
It is a good practice to read the documentation of a new function you have not used before. Here, we will read the documentation of the pivot_longer().
Doing this will open up documentation to read about the pivot_longer() function, including its arguments and some examples of its uses.
The next thing is to create a tibble. A tibble has similar properties to a data frame object, but it is easier to work with tibbles, especially with the tidyverse collection of R packages.
We will call the tibble object table1. The tibble stores information on the number of disease cases in 1999 and 2000 for three countries.
This data violates two tidy principles; 1) the column names are not correctly given 2) the years (1999 and 2000) should be in one column.
We will now use pivot_longer() to reshape the data and solve these violations.
Here, the %>% is called the pipe operator, and it is used to chain functions together in sequential and hierarchical order.
From the above, we have been able to pivot on the years 1999 and 2000 to get the output below:
Now, this data is in better shape, and it is easier to see the information that it contains. Another way to achieve the above is to pivot on all other columns except the country column. Doing this is extremely useful when you have a large dataset and do not have the luxury of typing all the columns to pivot on. We will achieve this using:
Much better, right? The -country above means that pivot on all other columns except the country column.
Did you notice that the new columns created do not have the correct column names? The proper column names should be year and n_cases (for the number of cases). Let us see how to achieve this:
Bravo! Quite intuitive, right? I just called two arguments inside of the pivot_longer() function; names_to and values_to.
I believe this suffices for now. Although, there are a whole host of things that you can still do. You can check the data tidying with tidyr cheatsheet here.
Before I stop the demonstration about the pivot_longer() function, it is important to note that another function inside the tidyr package does a similar job as the pivot_longer(). That is is the gather() function. Let us briefly see a use case here:
Similar results, yeah? However, they use different arguments to rename the columns. Beautiful!
Let us go on to the pivot_wider() function that works in the opposite direction to the pivot_longer().
b. Pivot wider
pivot_wider() — “widens” data, increasing the number of columns and decreasing the number of rows.
Again, you can read the documentation of the pivot_wider() function inside of your RStudio using ?pivot_wider
To demonstrate the use of pivot_wider(), we will read in a data set that contains information about the different planets we have.
A quick note here; before you can load the readr package using library(readr), you must have the package installed. Here, it is assumed that I have installed the package.
Note: In R, you can install any package by using install.packages(“package_name). To install the readr package for example, you can do install.packages(“readr”).
Another thing to note is that the data you have downloaded must be in your current working directory to load the data successfully. I suggest you copy this planet-data.csv file and other files from your downloads to your current directory.
I have used the read_csv() function against the base R read.csv() because I want to import this data as a tibble and not a data frame. You can execute this vignette(“tibble”) to read a full vignette about tibbles and how they differ from data.frame.
With this, let us see the result of the first six rows of the planet_df data.
You will notice that there is much information here, but we cannot see that information at a glance. Without a doubt, this is messy data. Now, let us deal with this using the pivot_wider() function.
Great, isn’t it? Do you notice a similarity between the arguments of the pivot_longer() and this pivot_wider()? I believe that was quite understandable.
Again, even though we have changed the long dataset that was not clear at first glance to a wide (rectangular) format, this data is not entirely tidy because we still have some missing values. Let us solve this problem by doing:
Could you identify the difference? I added the planet column inside the pivot_wider() as the column to be used to pivot. I believe that made sense to you.
Just like the pivot_longer(), the pivot_wider() also has a function that can perform the same role called the spread() function. You can check out how that works.
However, the pivot_longer() and pivot_wider() functions are better and preferred alternatives to the gather() and spread() functions respectively.
Before I end the discussion here, I must talk about how to unstack data from long to wide format. I have had to stay over 4 hours trying to solve this kind of problem in my R career. However, learning about this can help you achieve the proper data format in less than 2 minutes.
Let us begin to learn about unstacking data.
We will load an inbuilt R dataset called PlantGrowth and explore it. The data set results from an experiment to compare yields (as measured by the dried weight of plants) obtained under control and two different treatment conditions.
We will use data (“dataset_name”) to load an inbuilt R data set into the R workspace. In this case, we will use data(“PlantGrowth”) to load the PlantGrowth data into R.
Here, the head and tail functions will return the first and last five rows, respectively. It is so here because we specified 5. Usually, the head() and tail() functions, by default, return the first six and last six rows, respectively.
So, you can see that we have the weight of the plants and the treatment groups. The goal is to reshape this data such that each treatment group becomes the column with their respective weight values.
To achieve this, we will use the unstack() function — Stacking vectors concatenates multiple vectors into a single vector along with a factor indicating where each observation originated. Unstacking reverses this operation.
Amazing! Just with a simple call of the unstack() function, the PlantGrowth data looks to be in better shape. Now, we can see the weight values for the different treatment groups.
To solidify this understanding, let us consider another example with the DIETS dataset.
The diet column has two categories, LOWFAT and REGULAR diets, with each person’s weight loss. When the survey was performed, 100 participants were put into one of these two categories. Now, we want to make the categories of diet become columns with their respective weight loss values.
You would notice an additional argument in the unstack function — WTLOSS ~ DIET. This is called the form argument, a two-sided formula whose left side evaluates to the vector to be unstacked and whose right side evaluates to the indicator of the groups to create.
Here WTLOSS is the vector to be unstacked, and DIET is a categorical variable that will become columns. We can see that the numeric variable comes first in the formula, then the categorical variable comes next. I have put this here in situations where the unstack function does not work correctly. When you encounter such problems, specify this form argument.
From the first five rows, we can see that LOWFAT and REGULAR are the columns with their respective WTLOSS values.
Wow! Amazing! I believe that was quite interesting to walk through. Let us go on to the following tidy data principle.
Principle Two: Each Observation is stored in a row
separate_rows()
separate_rows() — If a variable contains observations with multiple delimited values, this separates the values and places each one in its row.
Similarly, you can read the documentation of the separate_rows() function.
Now, let us read in a data set that contains information about the different movies on Netflix with their title and cast names.
The cast column defiles this second principle of tidy data because each observation stores more than one value. In the next series of illustrations, we will examine this function using different examples.
Brace up for some beautiful ride!
Now, let us see how to deal with the messiness in the cast column, such that each cast or actor will be an observation.
What the above is saying is that we want to separate the cast column into rows. However, you will notice the sep argument, which tells us what to separate the column on. Notice that the separator is “, ”. The actors in the cast column were separated in the original data with a comma and a space.
Now, let us see the result.
Beautiful! The result shows tidy data. With this, we can decide to find the top six actors with the most appearances. Here, we will use some dplyr verbs or functions.
Note: The dplyr package is under the umbrella of a family of tidyverse packages. The dplyr package is very powerful, and in fact, it is the most useful package in R for data manipulation. It is my all-time best package for data manipulation. You can check out my project-based course on the Coursera titled Data Manipulation with dplyr in R here. I covered the use of the dplyr package in that course.
So, to find the top six actors with the most appearances, we will chain a couple of functions together using the pipe operator. Let us see how to achieve that result.
Remember that I mentioned that the pipe function works sequentially. Therefore, the above code separates the cast columns into a new row of observations. Then, it renames the cast column to actor. The rename() verb or function helps to rename a column name (the new name comes first in that order). Finally, it counts the number of cases (now actors) and returns the result in descending order since we are interested in the top six actors. The head() function returns the first six rows of a data set.
The output of the above is given below:
Amazing! Takahiro Sakurai appeared 18 times in this data set, Yuki Kaji 16 times, and other actors as shown. This example here emphasizes the need for data to be in the correct format. Having data in the proper format will aid seamless analysis.
Principle Three: Each cell stores a single value
separate()
separate() — turns a single character column into multiple columns.
Let us do some demonstrations here.
We will use the movies_durations data set here. It contains information about the different movies on Netflix with their title, movie type, and duration.
From the result above, you will notice that the duration column has both the time value and the unit of time. It is not tidy data.
We will use the separate() function to separate the duration into two columns — value and unit.
Recall the sep argument from separate_rows(). If you look at the duration column from above, you will notice that the separator between the time value and unit is a space, which is why sep = ‘ ’. The convert = TRUE argument will change the resulting columns to their correct data type upon separation. For example, the value column will be an integer upon separation. If the convert = TRUE argument is not specified, the value column will have the character data type (which is not the correct data type).
We have a beautiful result below:
Nice work there! Now the value and unit are in their respective columns with the right data type. Since the data is in a tidy format, we can decide to find the average duration for each movie type and time unit as shown below:
Again, tidy data enhances insightful analysis. So, we can we that the mean time duration for movies in this dataset is 98.6 minutes.
Before we leave this section, it is important to quickly mention the unite() function, which does the opposite of the separate() function.
Let us merge the title and type columns using the sep = “ - ” and call the newly merged column title_type.
Nice!!! As expected, the title and type columns are now in one column and are separated by a hyphen (-).
separate_rows() & separate()
Let us see a use case of the separate_rows() and separate() functions together to tidy the drinks_df data. The data contains three different drinks and the ingredients used to make the drinks. The data is an excel file; so we will use the read_excel() to read in the file.
The ingredients column is quite messy. We can first decide to separate the rows so that every ingredient will be on a separate row. Notice that the separator here is a semicolon and a space. The code below will help us achieve this:
One step is done. We have successfully put every ingredient of a drink on a new row. However, you will notice that the ingredients column has three values in each cell. We want to create three different columns from this ingredients column.
To get this done, we will pipe the separate_rows() and separate() together. We will separate the ingredients column into three columns ingredient, quantity, and unit.
Now, we have a tidy data set where we have the drink, the ingredient, quantity, and unit. You will notice a warning message. Since oranges do not have a unit, it fills that cell with NA. You do not have to worry about this.
With this, you have learned some excellent ideas here for your future data analysis work. In addition, because I did not want this article to be unneccessarily long and boring, I decided not to include how to deal with missing data, how to clean column names, how to merge data using tidyverse packages and tidyverse adjacent packages.
Summary
Wow! It has been a long stretch of reading and hands-on work with quality explanations. In this article, we have looked at the idea of tidy data and have demonstrated how to achieve different tidy data formats. Using a consistent tidy data format allows you to build tools that work well within this framework (data science cycle) as a data-inclined person. Ultimately, this simplifies the data wrangling, visualization, and analysis processes.
As a data enthusiast, you have a leg ahead when you start your data science project with data already in a correct (tidy) format or when you spend time at the beginning of a project to get your data into a tidy format. The remaining steps project will be like a piece of cake. I hope you enjoyed this. Please drop your comments in the comment section below and share this article with others. Don’t forget to give this the maximum number of claps.
As a project-based course instructor with Coursera Guided Project Network, I have taught a couple of courses on using R. You can check them out here to take any of these courses. Thank You! See you at another time! See you at another time!
References
- Chapter 1 Introduction to the Tidyverse | Tidyverse Skills …. https://jhudatascience.org/tidyversecourse/intro.html
- Tidy Data with tidyr — R for Data Science [Book]. https://www.oreilly.com/library/view/r-for-data/9781491910382/ch09.html