A Gentle Introduction to Tidy Data in R
Recently, I have been reading and taking some courses on the concept of tidy data. My discoveries motivated me to write this article about tidy data because data must be in the proper format (tidy) before analysis can be done.
In this series of articles, I will share some thoughts about tidy data, and we will have a walkthrough by tidying different datasets to conform to the tidy data principles.
Let us get right at it.
The data science life cycle usually begins with asking questions that data can provide answers to, and the cycle ends with the answer to that question. As a data scientist or data-inclined person, there are series of steps that you must take after you have formulated the question before arriving at an answer or a solution. Some of the steps you will take are:
- Determine what data will be useful
- import the data, tidy the data into a format that is easy to work with
- Explore the data
- Generate insightful visualizations
- Carry out the analysis, and
- Communicate your findings.
Iteratively going through this process defines the data life cycle at a high level. That is why it is usually said that 50–80% of a data scientist’s time is spent wrangling data and making the data in the right format.
In this first article, I will talk about tidy data with an emphasis on tidy data principles. Then, in the following article, I will show you how to use some R packages like the tidyr package in R to illustrate these tidy data principles using some data sets.
What is Tidy Data?
While reading through several materials, I found this definition by Hadley Wickham (Chief Scientist at RStudio) fascinating.
“Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.” ~ Hadley Wickham’s 2014 paper titled Tidy Data
I love this definition so much because it captures the primary building blocks that form the foundation for the tidy data principles that we will look at shortly. So, when it comes to thinking about tidy data, one thing that should guide your thoughts is that tidy data are rectangular data.
One way to think about tidy data is that it has to look like a rectangle with each variable/feature in a separate column and each entry/observation in a different row and all cells should contain some text with something in every cell.
Therefore, if you are working with a dataset and attempting to tidy it, the data has to look like a rectangle at the end of the process. If not, then more work is likely needed before the data is genuinely tidy.
On a final note here, when data is already in a tidy format, or you have spent time at the start of a data science project to get data into a tidy format, the remaining steps of your project will be more straightforward.
Now, let’s consider the guiding principles of tidy data.
Principles of Tidy Data
To start the discussion about the principles of Tidy Data, I will start with these two quotes:
“Happy families are all alike, but every unhappy family is unhappy in its own way.” ~ Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” ~ Hadley Wickham
The above quotes imply that tidy datasets, by design, are easier to manipulate, model, and visualize because the tidy data principles that we will discuss here impose a general framework and a consistent set of rules on data.
As I have mentioned, the tidy data format has a rectangular shape which means it has columns, rows, and cells, just like in a spreadsheet. However, for rectangular data to be tidy, there are three major principles to adhere to:
- Each column should hold a single variable
- Each row should hold a single observation.
- Each cell should hold a single value (if you follow the first two rules, then each cell should have a single value)
In addition, other tidy data principles are worthy of mention here:
- There should be one spreadsheet for each type of data
- No cell value should contain missing values
- If you have multiple spreadsheets, they should include a column in each spreadsheet with the same column label on which they can be joined or merged.
Now, let’s break down each one to ensure that we are on the same page.
Principle 1: Each variable is stored in a single column
In this example, these variables are the CustomerID, CustomerName, Segment, Age, Country, City, State, PostalCode, and Region of different customers.
Principle 2: Each observation of a variable is stored in a different row
In this example, each customer is an observation.
Principle 3: Each cell stores a single value
In this example, each cell stores a single value because we adhered to the first two tidy data principles.
To end this article, let’s also see examples of other tidy data principles that can be helpful.
Principle 4: There should be one spreadsheet for each type of data
In this example, the first spreadsheet (Sales Data) stores the information about sales and purchases of each customer. The second spreadsheet (Customers Data) stores the demographics of each customer.
Principe 5: If you have multiple spreadsheets, they should include a column in each spreadsheet with the same column label that allows them to be joined or merged
In this example, the first spreadsheet (Sales Data) and second spreadsheet (Customers Data) have a common column (CustomerID) on which the two data can be joined or merged.
Principle 6: No cell value should contain missing values
In this example, there are missing values in the ArtistId column. This dataset (Album Data) is not in tidy format. We will learn how to deal with these missing value cases in the next article where we will do some hands-on work using R.
We have just seen a gentle introduction to the concept of tidy data. This understanding is very useful for you as a data enthusiast or data expert.
Find more ideas on tidy data in this resourceful book titled Tidyverse Skills for Data Science in R by Carrie Wright, Shannon Ellis, Stephanie Hicks and Roger D Peng. here.
Would you like to see how these tidy data principles are implemented using some packages in R, then read my next article here.
Thank you for taking the time to read this article. I hope you learned some useful ideas here. Drop your comments in the comment section. Make sure to share this article with others. Don’t forget to give this a clap.
As a project-based course instructor with Coursera Guided Project Network, I have taught a couple of courses on using R. You can check them out here to take any of my courses. Thank You! See you soon!
References
- Chapter 1 Introduction to the Tidyverse | Tidyverse Skills …. https://jhudatascience.org/tidyversecourse/intro.html
- Tidy Data with tidyr — R for Data Science [Book]. https://www.oreilly.com/library/view/r-for-data/9781491910382/ch09.html