Day 19: The easiest way to remove outliers from time series data
We’ve all dealt with outliers in our time series data. Here is one very simple function that you can use for removing them.
hampel( data )
This one’s super straight forward and usually does the trick.
Let’s generate some fake data and place some outliers into it:
rng(10)
mydata = normrnd(0,1,100,1);
mydata([25,50,75]) = 5;
Now we’ll apply hampel using its own default parameters and see how we do.
mydata_filtered = hampel(mydata);
Wow!
We can use patch to show us which data points were scrapped.
Restart with the artificial data.
rng(10)
mydata = normrnd(0,1,100,1);
mydata([25,50,75]) = 5;
Now the good thing about hampel is that you can use it to tell you which points it removed.
It won’t tell you the exact number, but it will give you a vector with 1’s where it removed a point and 0’s everywhere else (you can use find to get the indices in the array, if that’s your thing). This works really well with patch, which I covered yesterday.
[mydata_filtered,filtered_logical] = hampel(mydata);
mydata_filtered(end+1) = nan;
filtered_logical(end+1) = nan;
x_values = [1:numel(mydata_filtered)];patch(x_values, mydata_filtered,... filtered_logical,'marker','o','markerfacecolor','flat')
The final output should look something like this:
But we can do better!
Fine tune hampel
For best results, you can adjust the parameters given to the hampel function.
If you know, for example, that outliers occur every 20 points or so and they are at least 3 standard deviations above the data around them, you can make this necessary adjustment to improve your filtering.
Use the code below to adjust your hampel function and see what you get!
[mydata_,logical_] = hampel(mydata,20,4)