Day 24: Swap out MATLAB functions with your own NaN-friendly ones
One of the problems with some of MATLAB’s built-in statistics functions is that they break when any of the input values are NaN. In this post I’ll show you a great opportunity to build on some of MATLAB’s extremely useful NaN-friendly versions of sum, mean, and std. This skill will allow you to replace NaN-unfriendly functions like corrcoef and zscore with your own custom functions that work in a variety of situations.
Make your own corrcoef robust to nan-values
For me, the most typical use of the correlation coefficient function is to find the correlation between two data sets: x and y. Here are my two gripes about MATLAB’s built-in corrcoef function:
- I want robustness to NaN values. If you try to calculate the correlation coefficient and there’s a nan value in either your x or y, forget about it
- I want one value. The correlation coefficient. I don’t want a matrix that tells me that x is perfectly correlated with itself, or that y is correlated with itself.
Let’s set up a simple data set and show off the problem.
rng(5);
x = [1:10]+rand(1,10);
y = [1:10]+rand(1,10);
x(end) = nan;corrcoef( x,y )
Yuck. A bunch of NaN’s.
Here’s the solution using a one-line anonymous function that takes both your x and y as inputs. Enter it once into your command-line or include it in your script.
mycorrcoef = @(x,y) (1./(numel(x)-1)) .* nansum( (x-nanmean(x))/nanstd(x) .* (y-nanmean(y))/nanstd(y) );
Might look unwieldy, but it gets the job done. Note that this is an approximation to the “real correlation coefficient” and if you have many NaN values, I would recommend interpolating or figuring out what’s going on — rather than presenting the correlation coefficient in a statistical table.
Now enter mycorrcoef( x, y ) and you should return a value of 0.7949, which is what we would expect since we made our data to be roughly correlated.
Do the same thing for zscore
The built-in MATLAB function zscore suffers the same fate. Any NaN values will completely destroy your result. Here’s how to get around it:
myzscore = @(x) (x-nanmean(x))./nanstd(x);
Try it with the x from the data above. You’ll see that z-scores are calculated for all of the values except the last one, which went in as NaN and came out as NaN! No disruption to your work flow.
One caveat: The anonymous functions shown here are intended to assist you in quickly getting a sense of correlations and outliers. If you are going to perform more advanced statistical analyses or you have a lot of NaN values, I highly recommend that you figure out where your NaN’s are coming from and think of a strategy to safely exclude those observations.
Hope this will inspire you to use more NaN-friendly operations in your everyday work, and also quickly improve some of the other built-in MATLAB functions.
This story is a part of my series titled ‘30 Days of MATLAB tips I wish I had known doing graduate school in neuroscience’. Follow me here Neurojojo or on Twitter to stay updated with more tips.