Is your Data hiding a secret?
- Elise Hampton

- Jan 23
- 4 min read

There’s a joke that I love:
There are two types of people in the world
Those that can extrapolate from incomplete data.
It’s funny! But it also hits a little close to home.
I am of course referring to the data being used in the training of Machine Learning (ML) models and how it’s not always what it should be. According to Statista, the global volume of data created, copied, captured and consumed is more than 150 ZETTABYTES, with 90% of this data from the last few years! To make this a more palatable number: 1 zettabyte is equivalent to 250 billion DVDs. The amount of data available is only going to increase with increasing social media usage, continued digitisation, the increasing number of people using the internet for content creation, and AI-Generated data. You’d think that more data would mean more information and knowledge, but that depends on the data.
Let’s back track a little; Around 2010 the buzz word/s “Big Data” started gaining traction, by 2015 it was at its height. At that time in my career, I was working with astronomical survey data (the Sloan Digital Sky Survey, SAMI, S7 if anyone is interested), so to me big data was a well organised catalogue of millions of astronomical images. Was I in for a shock later! While I was calculating clustering and star formation rates of galaxies the industrial world was collecting data. Data has always been collected but with the adoption of the Big Data paradigm, and cheaper technology, even more data could be collected and stored in a cheaper and accessible manner - all in the hope that it would be useful!
“Big Data” transformed into “Big Data Analytics”, “Actionable Intelligence”, and more recently “Artificial Intelligence”. Everyone wanted to use this magnificent collection of data, and mostly to make our lives better. Well, that’s always been the intention. Enter Data Scientists. Programmers with the unique skillset to dig in and understand what the data actually says. This is where my career went in 2017 when I moved out of Academia and into Industry, where all the "Big Data" was right? What I found among the industrial datasets was… less than organised and not always useful. But there was value in there!
The main reason a company would bring on a Data Scientist was to take their data and create something actionable which mostly translated to building a machine learning model to predict, prescribe, classify. Creating these models is a lot of fun and there are so many different types you can use but the model itself isn’t even half of the solution. What is the majority of the solution? The data you use to train it, and if you’re data isn’t right for the task then your model will produce some very interesting results.
There are different types of data that we need to keep an eye out for. The types that cause models to act unusually or wrong is the data that doesn’t represent the expectation of the model usage. This could be in the form of missing a subset of data.
Do you remember when digital cameras first started putting boxes around people’s faces so you could see them? One of the initial facial recognition software’s built (in a similar vein to those digital cameras) couldn’t detect the faces of people of colour. Why? Because the data being used was made up of Caucasian men and a smaller number of Caucasian women. It was good data; it was Big Data; it represented all the people on the research team. It didn’t, however, represent all the people who would be using the facial recognition model. They missed a large subset of data in training their model. But it’s also a simple correction – retrain including people of colour. This isn’t always the case.
There was a group in the US who were trying to create a prescriptive model to assign appropriate sentencing to people who are convicted of crimes. This was a well-intentioned use case but the data they used had a social bias built in – people of colour were previously given harsher sentences. You can see where this going? The model having been trained with this data learnt the social bias and perpetuated it. So even though the intention of the model had been to remove the bias caused by humans setting the sentencing the model was just copying what it had learnt, it was harmful. It thought that someone’s ethnicity meant that there should be a harsher sentence. The training data wasn’t necessarily missing a subset of data, but it didn’t represent the outcomes of the model they intended.
Unbiasing a dataset can be very tricky, if not impossible. In the case of the sentencing model, you could remove the bias indicator, ethnicity, and all associated indicators like home address or location of crime which can also indicate ethnicity. But this then provides the model with a larger range of prediction values for the same conditions. Which poses a new problem where the model isn’t very precise in its answers and will continue to have a large range but more randomly. Perhaps this use case is better suited to a standard set of rules that don’t include ethnicity.
We know how important the data we use is to what and how a model can be correctly used. How do we ensure that the models are trained on representative/complete and unbiased datasets? Dig in! And not just to the data but also into every possible use of your model. You need to ensure that your data used in training covers all the possibilities and you need to ensure you understand all possibilities. Having these two ends sorted the model in the middle will be the easiest part of the solution.
The data used in training of ML models is very important. It should be complete, organised, and unbiased. We can ensure that our models are only trained on the best of data. This might include building new datasets or updating them when they are missing subsets or it could mean not using the dataset at all, so we don’t create something harmful. The intention of ML and AI has always been to help us live better lives. So let’s make sure that stays the case.


Comments