What is ragged data (and why do we care)?
By Andy Kerfonta
Ragged data (also known as a ragged matrix, irregular data, non-rectangular data, or an irregular matrix) is nothing more than a dataset with a different number of columns on each row. Why is this useful? Let’s look at a quick scenario using some baseball data.
We’ll start with two files: Master.csv and Salaries.csv . Master.csv contains a list of all players with their personal information, one row per player. Salaries.csv contains a list of all recorded salaries, one row per salary, so each player may have many rows or none at all.
What happens if we want to merge (join) these two tables together? In traditional rectangular data, you either end up with duplicated players in a vertical table (one row per salary) or a lot of nulls in a more horizontal table (every player has the same number of salary fields – some unused). Here is what happens when Transdata is used to create ragged data:
What we end up with is something much more logical: each player gets a single record with all available salary data. We can then reference particular salary values or do operations over them all (like average or sum by record). It is even possible to do something simple yet useful such as finding the average of every player’s first salary.
So why is ragged data important? The most obvious answer is that it makes merging data a lot simpler and more intuitive. SQL JOINs are easily derailed by messy or inconsistent input data. Since ragged data by definition cannot depend on data shape, field names must be used. This means input data can be ordered differently or even be missing fields altogether without breaking merges. Data integrity is maintained.
Ragged data can also reduce or eliminate data transpositions (swapping of columns and rows). In SAS, “…it is more efficient to store your data in a vertical format and processing the data is easier in a horizontal format.” The same limitation holds true for most rectangular data software. By being able to merge and operate on ragged data efficiently, the number of steps required for even simple transformations can be significantly reduced, and with it, the time spent.