Learning Curves - Part Of Sagitto's Data Science As A Service

How much training data will we need? Let's look at a 'map'.

When building a machine learning model, our customers often ask "How much training data will we need?" It's rather like kids in the back seat of the car on a long journey, asking how much further until we get there? Just as a map can help us work out how far we have to go on a journey, we can generate Learning Curves that show the direction and likely number of training samples we will need to complete our model.

Learning Curves are like maps - they show how much more data is need for the machine learning model — Learning Curves are like maps

Obtaining data to train machine learning models can be an expensive and time consuming process: asking how much data is required is very reasonable.

It is easy to say that "more data is always better", but there are diminishing returns to adding more training data as a dataset grows.

When there are only 50 datapoints to learn from, adding 10 more can make a big difference to the model. When there are 500 datapoints, the next 10 will make less of a difference, and when there are 5000 datapoints they might make almost no difference at all.

To answer the question of how much data is needed, we can use 'Learning Curves'. A learning curve is created by building our machine learning models on successively larger portions of the current dataset - for example, we might train models with 10% of the current data, then 20%, 30%, and so on. We evaluate the performance of each of these models, and plot those results on a graph (as shown below).

The model is evaluated multiple times for each dataset-size. The solid line is the average error score across all these selections, and the shaded area represents the standard deviation.

The exact shape of the learning curve is different for every dataset, but they have a general trend - the model performance flattens out as the effect of diminishing returns kicks in. There is also an element of randomness in each learning curve's exact shape - for example, from the randomness in selecting the specific 10% used for the first portion of the dataset. We actually calculate the learning curve many times over and the final curve is an average - the variation between the individual curves is shown with a shaded area on the graph.

The final slope of the learning curve can be used to estimate the expected benefit (in terms of improved accuracy) of collecting new data, which can be weighed up against the cost (in time or money) of collecting that data. Alternatively if the customer requires a specific level of accuracy from their model, we can extrapolate the learning curve to estimate how much more data we'd need to reach that point (assuming we haven't passed it already!).

Of course, collecting a dataset is not just about the raw numbers of samples - it is also important to collect the right distribution of samples, to ensure that it is representative and robust (stay tuned for future posts!). At Sagitto, we work directly with our customers to understand their use-cases and to ensure their datasets and their models are of the highest possible quality.

‍

Subscribe to Sagitto's Blog

Get industry insights that you won't delete, straight in your inbox.

We use contact information you provide to us to contact you about our relevant content, products, and services. You may unsubscribe from these communications at any time. For information, check out our Privacy Policy.

George Hill

Sagitto Ltd

Sagitto's founder, George Hill, first started working with artificial intelligence during the 1980s, while developing 'expert systems' within Bank of America in London. On returning to New Zealand, he undertook part-time study with the University of Waikato's Machine Learning Group while working for Hill Laboratories, a well-known New Zealand commercial testing laboratory. This led to the formation of Sagitto Limited, dedicated to combining the power of artificial intelligence and machine learning with spectroscopy.

More news

Authentication

From Bee Or Not From Bee

That is the question. Is this pure honey, produced only by honey bees? Or has it been adulterated with cheaper sugar syrups? This blog post explores some of the methods that can be used for testing honey for adulteration with syrups. We start with NIR analysis, then look at the C4 Sugars test, Spatial Offset Raman Spectroscopy (SORS), and Nuclear Magnetic Resonance Spectroscopy (NMR)

Read Article

Food

Just Peanuts?

Peanuts are a hugely valuable food crop, with worldwide production of more than 50 million tonnes. This blog post looks at how NIR spectroscopy can help peanut breeders and manufacturers of peanut products. We focus on what NIR can tell us about the composition of peanut kernels and peanut butter; and as an example we use NIR to measure sucrose when added to peanut butter.

Read Article

Data Science

Learning From Durum Flour

Good pasta is made from durum wheat flour rather than the softer wheat varieties used in bread. However being more expensive, durum flour ('semola rimacinata') is sometimes adulterated with other wheat flour. For a bit of fun, we've created a model that has the potential to detect adulteration of durum wheat flour. And we use this to demonstrate how Sagitto evaluates calibration models.

Read Article

Are We There Yet?

George Hill

Subscribe to Sagitto's Blog

George Hill

More news

From Bee Or Not From Bee

Just Peanuts?

Learning From Durum Flour