Day 29: Join your tables to make a sorted boxplot
Joining tables is an art and the benefits and consequences of joining are not always be clear. Here, I’ll describe a simple case that works well even for a beginner learning to use tables.
Joining for enhanced boxplots
Here, I’ll use the generic join to spruce up a boxplot and accentuate a pattern in the data. First, load up some data from MATLAB’s standard library.
load census1994
adultdata_subset = adultdata( find(adultdata.capital_gain>0),:);
I created a subset of the data that only contains entries where individuals reported any capital gain at all. Next, graph it sideways with some data limits imposed and clipping the outliers.
figure('color','w'); boxplot( adultdata_subset.capital_gain, adultdata_subset.race, 'datalim', [0,30000], 'extrememode', 'clip' )
ylabel('Capital gain ($)')
camroll(-90)
There’s clearly a disparity in the capital gain amount based on race, but we can really make this result appear if we sorted the boxes.
Ordering a boxplot
To improve the presentation of this data, let’s try to show the boxplot with the lowest medians near the top and increasing medians as you go down.
This requires five easy steps:
- Create a separate table newtable using varfun that will contain the median value for each race.
- Sort newtable by the median value, and store the sorting order in a separate column.
- Create a ranking of the sorting order by using the sort-twice trick.
- Innerjoin the original data table with newtable.
- Plot and add the sorted column of ‘race’ as x-tick labels.
Here’s the full process:
The join function basically distributes the values from ranking_table into adultdata_subset to matching entries of the “race” variable.
By appending the “rank” column, we are able to, in essence, create a new grouping for our data and MATLAB’s boxplot function leverages that to create an ordered display.
When things get complicated
Often tables are joined which have different columns, unmatched rows, and other idiosyncrasies. Getting good at joining tables can take weeks to months and there are dozens of ways to use the additional innerjoin and outerjoin functions.