Sagitto uses a variety of methods to detect outliers in training data

Outlier detection is an important step in preparing spectroscopy data for machine learning models. Outliers can occur for a variety of reasons such as errors in instrument use, mistakes in sample preparation, or accidental sample swaps. Here we use a real world example to illustrate the approach that Sagitto takes to detect outliers in training data.

Sagitto recently developed calibration models for a PerkinElmer DA7250 at-line NIR instrument, as part of a pre-purchase evaluation exercise conducted by a large hops producer. We use this as a case study to describe how we detect outliers in our training data, and to compare this to using Hotelling's T²and Q-Residuals.

Sanity Check - Look At The NIR Spectra

As a first step, we plot the NIR spectra in the training set to see if any look unusual. In this case, the spectra all have the general shape we expected.

NIR Spectra for hops from PerkinElmer DA7250 — NIR absorbance spectra of ground hops from a PerkinElmer DA7250 instrument

Examine The Initial Cross-Validation Plot

Our next step was to build an initial multivariate calibration model using Sagitto's proprietary techniques, and examine its cross-validation plot.

Alpha Acids in Ground Hops - PerkinElmer DA7250 — Cross-validation plot of initial model built using uncleaned data

We immediately noticed two unusual results.

Outlier in alpha acids in hops measured using PerkinElmer DA7250 — Sample 22BP02003 had predicted alpha acids of 13.6 compared to the reference value of 5.1

Another outlier in alpha acids in hops measured using PerkinElmer DA7250 — Sample 22BP02016 had predicted alpha acids of 10.7 compared to a reference value of 16.6.

Talk To The Customer

Having noticed these two anomalous results in our intial model, we checked with our customer. Sure enough, there were easy explanations: sample 22BP02003 should have had a reference value of 14.7, not 5.1, and the spectrum that we had been supplied for sample 22BP02016 was mislabelled. After correcting these two outliers, we rebuilt the model and got much better results.

Sagitto model of alpha acids in ground hops using PerkinElmer DA7250 — Cross-validation plot of Sagitto model built after correction of two outliers, with potential outlier 22BP01909 circled in blue

Yet another Outlier in alpha acids in hops measured using PerkinElmer DA7250 — Sample 22BP01909

Now another sample (22BP01909) seemed to be a potential outlier. However our customer confirmed the reference value for this sample was correct, and we chose not to remove it. It could be simply an unusual sample, and removing it could be a mistake.

We need to balance the desire to remove outliers in order to increase model accuracy, against the risk that we over-fit a model to data that isn't representative of what it will see when deployed.

The Same Process Using PLS Regression

Just for comparison, we repeated our outlier detection process using the widely used Partial Least Squares (PLS) technique, such as might be created with Aspen Unscrambler X. This initial PLS model also highlighted our two outliers.

Initial PLS regression model for alpha acids in hops using PerkinElmer 7250 — Cross-validation plot of initial PLS Regression model built using uncleaned data

After correcting these two outliers, we rebuilt the PLS Regression model. As expected, this resulted in an improvement - although the revised PLS model is not as good as the model built using Sagitto’s proprietary machine learning techniques. (Incidentally, this illustrates why Sagitto rarely uses PLS Regression.)

Revised PLS regression model for alpha acids in hops using PerkinElmer 7250 — Cross-validation plot of PLS Regression model, rebuilt after correcting two outliers and with sample 22BP01909 circled in blue.

Once again the new PLS Regression model suggests that sample 22BP01909 might be a potential outlier.

We want to be sure that any outlier that we remove is definitely an error, and not just an unusual sample that doesn't fit the model's notion of a 'good' sample.

The outlier detection method described above - eyeballing the NIR spectra, then reviewing cross-validation plots of initial calibration models - works fine for small training sets where we have high confidence in the source of the data. But it may struggle to scale to large data sets in which the providence of each individual data point is less certain. For that reason, it's worth reviewing some of the more automated outlier detection methods used in spectroscopy applications.

Outlier Detection Using Hotelling's T² and Q-Residuals from PLS Regression Models

A common technique for identifying outliers in PLS models is to calculate two statistics for each sample - Hotelling's T² and Q-Residuals. Usually these two statistics are visualised in a scatter plot, with a 95% confidence interval also plotted to give the values a sense of scale. Here's what we found when we used our original PLS model to calculate Hotelling's T² and Q-Residuals for the hops data, prior to correcting samples 22BP02003 and 22BP02016.

Outlier identification in hops data for PerkinElmer DA7250 using Hotellings T2 and Q Residuals — Sample 1944 has a high Q-Residual value, and sample 2016 has a high Hotelling's T2 value

Looking at the highest values in tabular form, we see that while sample 2016 stands out with a T² value of 57.6, sample 2003 (our other known error) doesn't make it into the top tier of potential outliers using this method. However a new candidate emerges in 1944, with a very high Q-Residual value.

T Squared and Q Residuals — T2 and Q-Residuals in original data

This surprisingly high Q-Residual value prompted us to take a closer look at sample 1944. We concluded that it was not an error and should stay. Having cleaned the training set of our two genuine outliers - 2003 and 2016 - we calculated the Hotelling's T² and Q-Residuals values for the revised PLS model. This generates more candidates for consideration as outliers.

More outlier identification in hops data for PerkinElmer DA7250 using Hotellings T2 and Q Residuals — Samples 1912 and 2013 now come into contention.

When To Stop?

At some point in the hunt for outliers, a decision needs to be made about when to stop. Sagitto tends to err on the side of caution, and only remove data from a training set when we're sure that it's an error and not just an unusual sample. However, Hotelling's T² and Q-Residuals have a place for large, noisy datasets where a more automated approach is required.

To illustrate how this can work, the video below shows 329 NIR spectra being removed from a dataset of 10,243 scans of mango fruit being measured for dry matter, with the 95% confidence intervals marked in blue (T²) and yellow (Q-Residuals). As each sample is removed, a new PLS model is created and the T² and Q-Residual values are recalculated on the remaining data. These T² and Q-Residual values change with each iteration. Just as you might think that sufficient samples have been excluded, new ones become candidates for removal! The decision on when to stop ultimately becomes a subjective one. In the paper which Sagitto used as the basis for this example, the authors (using a different process to the one shown here) chose to exclude 329 spectra (about 3% of the initial dataset).

Conclusion

Outlier detection is an important step in preparing spectroscopy data for machine learning models. Hotelling's T² and Q-Residuals are two outlier detection methods commonly used in chemometrics. However, Sagitto has found that they need to be used with caution to avoid discarding unusual but valid data.

Acknowledgements

Special thanks to the following :-

Daniel Pelliccia of NIRPY Research for his blogpost 'Outliers Detection with PLS Regression for NIR Spectroscopy in Python'

Anderson, Nicholas & Walsh, Kerry & Flynn, Jamie & Walsh, J. (2020). Achieving robustness across season, location and cultivar for a NIRS model for intact mango fruit dry matter content. II. Local PLS and nonlinear models. Postharvest Biology and Technology. 171. 111358. 10.1016/j.postharvbio.2020.111358.

‍Mishra, Puneet & Passos, Dário. (2021). A synergistic use of chemometrics and deep learning improved the predictive performance of near-infrared spectroscopy models for dry matter prediction in mango fruit. Chemometrics and Intelligent Laboratory Systems. 212. 10.1016/j.chemolab.2021.104287
‍
My father Rowland Blackith Hill, who taught me many things including how to draft sheep - and the occasional goat.

‍

Subscribe to Sagitto's Blog

Get industry insights that you won't delete, straight in your inbox.

We use contact information you provide to us to contact you about our relevant content, products, and services. You may unsubscribe from these communications at any time. For information, check out our Privacy Policy.

George Hill

Sagitto Ltd

Sagitto's founder, George Hill, first started working with artificial intelligence during the 1980s, while developing 'expert systems' within Bank of America in London. On returning to New Zealand, he undertook part-time study with the University of Waikato's Machine Learning Group while working for Hill Laboratories, a well-known New Zealand commercial testing laboratory. This led to the formation of Sagitto Limited, dedicated to combining the power of artificial intelligence and machine learning with spectroscopy.

More news

Authentication

From Bee Or Not From Bee

That is the question. Is this pure honey, produced only by honey bees? Or has it been adulterated with cheaper sugar syrups? This blog post explores some of the methods that can be used for testing honey for adulteration with syrups. We start with NIR analysis, then look at the C4 Sugars test, Spatial Offset Raman Spectroscopy (SORS), and Nuclear Magnetic Resonance Spectroscopy (NMR)

Read Article

Food

Just Peanuts?

Peanuts are a hugely valuable food crop, with worldwide production of more than 50 million tonnes. This blog post looks at how NIR spectroscopy can help peanut breeders and manufacturers of peanut products. We focus on what NIR can tell us about the composition of peanut kernels and peanut butter; and as an example we use NIR to measure sucrose when added to peanut butter.

Read Article

Data Science

Learning From Durum Flour

Good pasta is made from durum wheat flour rather than the softer wheat varieties used in bread. However being more expensive, durum flour ('semola rimacinata') is sometimes adulterated with other wheat flour. For a bit of fun, we've created a model that has the potential to detect adulteration of durum wheat flour. And we use this to demonstrate how Sagitto evaluates calibration models.

Read Article

Outlier Detection - Sorting The Sheep From The Goats

George Hill

Sanity Check - Look At The NIR Spectra

Examine The Initial Cross-Validation Plot

Talk To The Customer

The Same Process Using PLS Regression

Outlier Detection Using Hotelling's T2 and Q-Residuals from PLS Regression Models

When To Stop?

Conclusion

Acknowledgements

Subscribe to Sagitto's Blog

George Hill

More news

From Bee Or Not From Bee

Just Peanuts?

Learning From Durum Flour

Outlier Detection Using Hotelling's T² and Q-Residuals from PLS Regression Models