Sagitto recently developed calibration models for a PerkinElmer DA7250 at-line NIR instrument, as part of a pre-purchase evaluation exercise conducted by a large hops producer. We use this as a case study to describe how we detect outliers in our training data, and to compare this to using Hotelling's T2 and Q-Residuals.
Sanity Check - Look At The NIR Spectra
As a first step, we plot the NIR spectra in the training set to see if any look unusual. In this case, the spectra all have the general shape we expected.
Examine The Initial Cross-Validation Plot
Our next step was to build an initial multivariate calibration model using Sagitto's proprietary techniques, and examine its cross-validation plot.
We immediately noticed two unusual results.
Talk To The Customer
Having noticed these two anomalous results in our intial model, we checked with our customer. Sure enough, there were easy explanations: sample 22BP02003 should have had a reference value of 14.7, not 5.1, and the spectrum that we had been supplied for sample 22BP02016 was mislabelled. After correcting these two outliers, we rebuilt the model and got much better results.
Now another sample (22BP01909) seemed to be a potential outlier. However our customer confirmed the reference value for this sample was correct, and we chose not to remove it. It could be simply an unusual sample, and removing it could be a mistake.
We need to balance the desire to remove outliers in order to increase model accuracy, against the risk that we over-fit a model to data that isn't representative of what it will see when deployed.
The Same Process Using PLS Regression
Just for comparison, we repeated our outlier detection process using the widely used Partial Least Squares (PLS) technique, such as might be created with Aspen Unscrambler X. This initial PLS model also highlighted our two outliers.
After correcting these two outliers, we rebuilt the PLS Regression model. As expected, this resulted in an improvement - although the revised PLS model is not as good as the model built using Sagitto’s proprietary machine learning techniques. (Incidentally, this illustrates why Sagitto rarely uses PLS Regression.)
Once again the new PLS Regression model suggests that sample 22BP01909 might be a potential outlier.
We want to be sure that any outlier that we remove is definitely an error, and not just an unusual sample that doesn't fit the model's notion of a 'good' sample.
The outlier detection method described above - eyeballing the NIR spectra, then reviewing cross-validation plots of initial calibration models - works fine for small training sets where we have high confidence in the source of the data. But it may struggle to scale to large data sets in which the providence of each individual data point is less certain. For that reason, it's worth reviewing some of the more automated outlier detection methods used in spectroscopy applications.
Outlier Detection Using Hotelling's T2 and Q-Residuals from PLS Regression Models
A common technique for identifying outliers in PLS models is to calculate two statistics for each sample - Hotelling's T2 and Q-Residuals. Usually these two statistics are visualised in a scatter plot, with a 95% confidence interval also plotted to give the values a sense of scale. Here's what we found when we used our original PLS model to calculate Hotelling's T2 and Q-Residuals for the hops data, prior to correcting samples 22BP02003 and 22BP02016.
Looking at the highest values in tabular form, we see that while sample 2016 stands out with a T2 value of 57.6, sample 2003 (our other known error) doesn't make it into the top tier of potential outliers using this method. However a new candidate emerges in 1944, with a very high Q-Residual value.
This surprisingly high Q-Residual value prompted us to take a closer look at sample 1944. We concluded that it was not an error and should stay. Having cleaned the training set of our two genuine outliers - 2003 and 2016 - we calculated the Hotelling's T2 and Q-Residuals values for the revised PLS model. This generates more candidates for consideration as outliers.
When To Stop?
At some point in the hunt for outliers, a decision needs to be made about when to stop. Sagitto tends to err on the side of caution, and only remove data from a training set when we're sure that it's an error and not just an unusual sample. However, Hotelling's T2 and Q-Residuals have a place for large, noisy datasets where a more automated approach is required.
To illustrate how this can work, the video below shows 329 NIR spectra being removed from a dataset of 10,243 scans of mango fruit being measured for dry matter, with the 95% confidence intervals marked in blue (T2) and yellow (Q-Residuals). As each sample is removed, a new PLS model is created and the T2 and Q-Residual values are recalculated on the remaining data. These T2 and Q-Residual values change with each iteration. Just as you might think that sufficient samples have been excluded, new ones become candidates for removal! The decision on when to stop ultimately becomes a subjective one. In the paper which Sagitto used as the basis for this example, the authors (using a different process to the one shown here) chose to exclude 329 spectra (about 3% of the initial dataset).
Conclusion
Outlier detection is an important step in preparing spectroscopy data for machine learning models. Hotelling's T2 and Q-Residuals are two outlier detection methods commonly used in chemometrics. However, Sagitto has found that they need to be used with caution to avoid discarding unusual but valid data.
Acknowledgements
Special thanks to the following :-
Daniel Pelliccia of NIRPY Research for his blogpost 'Outliers Detection with PLS Regression for NIR Spectroscopy in Python'
Anderson, Nicholas & Walsh, Kerry & Flynn, Jamie & Walsh, J. (2020). Achieving robustness across season, location and cultivar for a NIRS model for intact mango fruit dry matter content. II. Local PLS and nonlinear models. Postharvest Biology and Technology. 171. 111358. 10.1016/j.postharvbio.2020.111358.
Mishra, Puneet & Passos, Dário. (2021). A synergistic use of chemometrics and deep learning improved the predictive performance of near-infrared spectroscopy models for dry matter prediction in mango fruit. Chemometrics and Intelligent Laboratory Systems. 212. 10.1016/j.chemolab.2021.104287
My father Rowland Blackith Hill, who taught me many things including how to draft sheep - and the occasional goat.