What is Data Bias?

The Basics

The Basics is a series exploring the concepts and individuals essential to purposeful business.

6 minute read

28th Jul 2023

In 2014, Amazon had been using experimental machine learning software to hire their top tech talent. The AI would observe historical patterns from relevant CVs, take notes, and build their version of an ideal candidate.

However, not long after its launch, the AI started penalising women’s applications. Why? It had developed a preference for men’s applications. The AI was simply doing what it was taught; due to the prevalence of male CVs that it had been trained to recognise patterns from, it began to filter out any mention of the word ‘woman’ – whether that’s “women’s chess club captain” or being part of a women’s college. Amazon swiftly scrapped the recruiting tool; and with it went people’s trust.

This isn’t just a damning story about the dangers of AI. It points to a much older – and more prevalent – affliction that affects the way we consume information every single day: data bias.

What is data bias?

Data bias occurs when data or information is limited in some way, painting an inaccurate representation of the population, or doesn’t tell the full story. It has the capacity to impact individuals, businesses and even societies at large.

Take the Amazon example above: the data created the ideal candidate for the role which, among other traits, was a man. This negatively impacted the likelihood of women being hired for the role, which perpetuated gender inequality within the industry at large, and as a side-effect directly impacted the company’s reputation.

Data bias was potentially first described 400 years ago by Francis Bacon, when he wrote: “today’s theories were derived from scant data and few, frequently occurring cases, after which they were modelled. It is thus no wonder that they fail to predict new occurrences.”

And it’s not the AI’s fault; human biases were baked into Amazon’s machine learning software. AI will only be as biased as the data from which it feeds. Through the process of collecting, processing and analysing data, biases can arise, but the danger comes when data is presented as impartial or neutral – an undeniable truth.

What are some examples of data bias?

While data bias is heavily manifesting in AI, we have seen data bias throughout time.

MORE BASICS

Data bias was potentially first described 400 years ago by Francis Bacon, when he wrote: “today’s theories were derived from scant data and few, frequently occurring cases, after which they were modelled. It is thus no wonder that they fail to predict new occurrences.” These words set the groundwork for what we understand to be an accurate – or as accurate as humanly possible – scientific experiment.

And once you know it, you start seeing it everywhere.

It could be visual-recognition algorithms failing to detect black people over white, or people driving into lakes because their Sat Nav told them to. Maybe a researcher halts their experiment once they’ve got the outcome that they were hoping for. It’s likely whenever you see a sensationalist statistic from a tabloid, or when you come across what Darrell Huff calls a ‘gee whiz graph’ – where an axis is distorted to make the claim seem more extreme than it is.

Never Miss A Story

Data bias can even have implications on whether people survive in a car crash. Crash test dummies mimic the average man: around 5 foot 8, weighing 76kg. A woman – likely shorter, sitting in a different position and whose head doesn’t reach the safest part of the headrest – is more likely to suffer injury compared to a man in a car crash.

What are some of the different types?

The number of ways that data can be misrepresented are enormous; far longer than can be included here. The following types of bias are what Google includes as part of their machine learning course, to inform developers on how they can mitigate the effect of bias when building machine learning models:

Reporting bias. Reporting bias occurs when the frequency of events, properties, and/or outcomes captured in a data set does not accurately reflect their real-world frequency.

Selection bias. When a sample population is not appropriately or suitably selected. This can be ‘coverage bias’, where data isn’t selected in a representative fashion, ‘non-response bias’, where participation gaps influence the data, or sampling bias, where proper randomisation isn’t used during data collection.

Group attribution bias. When data is applied uniformly to individuals and groups, assuming their behaviour and characteristics are the same. This can manifest as ‘in-group bias’, where you may have a preference for members of a group to which you also belong, or for characteristics that you also share, or ‘out-group bias’, which is ‘a tendency to stereotype individual members of a group to which you do not belong, or to see their characteristics as more uniform’.

Automation data bias. When people tend to favour information generated by automated systems over human-generated sources.

Implicit data bias. When individuals make assumptions and respond based on their experiences, often without being consciously aware of the bias.

What can business leaders do to avoid data bias?

It’s worth business leaders asking at every possible opportunity whether the data they are gathering is truly representative: in collecting, processing, analysing and sharing – and ensuring their teams are aware of these biases. This could be when conducting internal surveys, contemplating whether to incorporate machine learning into your organisation, or using statistics in marketing collateral.

Just because you’re working with data, doesn’t mean that the information isn’t biased. Whether done consciously or not, data must be dealt with authentically. Because really, it’s all “lies, damned lies, and statistics”.

Further reading