Big Tech's Misguided Use of Big Data: The Illusion of Perfect Prediction
Written on
The ethical implications of major tech firms leveraging extensive data to forecast human behavior have been a topic of debate for years. A notable incident from 2012 serves as a prime example: a father entered a Target store to express his frustration over the retailer sending promotional materials for pregnancy products to his teenage daughter.
Initially, the store manager was taken aback and apologized for the mistake. However, the father later contacted the manager to confess that his daughter was indeed pregnant, leading to an unexpected apology. This story highlights how Target had effectively utilized data analytics to anticipate the girl’s pregnancy before her own father was aware of it (source: Forbes).
By 2022, anecdotes about targeted advertisements appearing shortly after casual conversations are increasingly common. The rise of interconnected AI systems like Alexa, Cortana, and Siri has intensified concerns regarding big tech's ability to glean insights about individuals, often surpassing their own self-awareness.
In this article, we will delve into the foundational mathematics that underpin these data analytics algorithms, scrutinizing their effectiveness. It’s important to clarify that while concerns about these technologies are valid, the algorithms may not be as sophisticated as many believe. Let’s dive in.
Big Tech's Data-Driven Predictions for Pandemic Infections
Imagine a scenario where a major tech company has devised a data analytics algorithm designed to forecast pandemic infection rates. While this is a fictional example, any similarity to real-world events is purely coincidental.
You might wonder why such a model would be necessary. In a hypothetical world where individuals are reluctant to disclose pandemic infections due to fears of penalties like quarantines, a regulatory body might fund such a project. Alternatively, the tech firm could intend to market the analytics data to pharmaceutical companies producing vaccines or medications.
Now, assume the model analyzes a population of 200 million people. This model would utilize a concept known as "degree of confidence" to indicate potential infections. Contrary to popular belief, these algorithms do not predict outcomes with complete certainty. They operate with a statistical confidence level, indicating the likelihood of a prediction being accurate. For our example involving 200 million individuals, the output might resemble the following:
The left section of the four-quadrant matrix displays the total predicted infections, while the right side shows those not predicted to be infected. The upper half represents the actual infected individuals, whereas the lower half depicts those who are not infected.
Analyzing the Data Analytics Model
Let’s say you work for this tech company and discover that a neighbor appears on the predicted infection list. Naturally, you would be concerned about their potential infection status, seeking to confirm if they belong to the upper-left section of the matrix.
Upon reviewing the data, you notice that only 0.01% (10 out of 100,000) of those predicted to be infected are actually confirmed cases; essentially, nearly none. Conversely, of the confirmed infected individuals, only 0.1% (10 out of 10,000) are included in the predicted list. This suggests there is a 99.99% probability that your neighbor is not infected.
At this point, you might think this model is inadequate. However, let’s consider another angle. If we assume the null hypothesis that any individual is not infected, what is the probability that they would appear on the list purely by chance?
From the gathered data, it’s evident that 99,990 individuals appeared on the list by chance. The total number of uninfected individuals was 199,990,000. Therefore, the probability that any individual would make it to the list by chance is calculated as follows:
Probability that any individual would appear on the list by chance = 99,990/199,990,000 = 0.0499% (approximately).
An uninfected individual has roughly a 1 in 2000 chance of being misclassified. Applying R.A. Fisher’s threshold of 1 in 20 for statistical significance (for further details, see my essay on understanding statistical significance), we could reject the null hypothesis. This implies we could assert that your neighbor has an infection risk with a misprediction probability of 0.05%.
Surely, this flawed model doesn’t reflect how genuine tech companies operate, right? Let’s explore a real-world instance.
Real-World Example of Big Tech's Data Usage
In 2006, Netflix launched a competition offering a $1 million prize for a recommendation algorithm that outperformed its existing system by 10%. Participants were given access to a massive dataset containing approximately one million anonymized ratings for 17,700 films.
It took three years for anyone to surpass Netflix’s algorithm by that margin. To achieve this, multiple teams collaborated, combining their individual models. Remarkably, even after all this effort, Netflix chose not to implement the winning algorithm.
Why not? By the time the new algorithm was developed, Netflix was shifting its focus from DVDs to online streaming. In the realm of streaming, subpar recommendations are less impactful than with physical media.
This scenario sheds light on how big tech leverages data analytics to anticipate human behavior. First, why was Netflix willing to pay $1 million for an algorithm just 10% superior to its own? The answer lies in the fact that a 10% enhancement in the prediction market represents significant monetary value, far exceeding the $1 million prize.
Second, why did competitors take three years to solve the problem? The complexity of the issue demands substantial resources, and even after improvements, inherent limitations exist regarding how accurately such models can predict outcomes using statistical methods.
Connecting the Dots: Target, Hypotheticals, and Netflix
The key distinction between our hypothetical example and the Target/Netflix cases is that the former involves relatively rare occurrences, while the latter pertains to more frequent events. Without getting into excessive technicalities, the rarity of the event being predicted diminishes the utility of big data analytics.
Suppose the hypothetical tech firm invests considerable resources to enhance the model’s accuracy. Doubling a very small number still yields a small number, merely converting a ‘very poor’ model into a ‘poor’ model.
One innovative company tackling these types of challenges is Palantir, which initially aimed to predict criminal activity using big data analytics. According to Peter Thiel, Palantir achieved breakthroughs by integrating a ‘big-data-plus-human-expert’ methodology.
Given the sensitive nature of these issues, verifying the effectiveness of Palantir’s models is challenging. Even assuming they meet the claims of their developers, it’s clear that big data has inherent limitations. So, what does all this imply about big tech's surveillance of our everyday lives?
The Fallacy of Big Tech's Predictive Prowess
It’s important to note that our hypothetical example does not accurately represent the cutting-edge of big data analytics. With advancements in machine learning, algorithms continue to evolve. Nevertheless, this fictional scenario effectively illustrates the fundamental challenges associated with big data analytics, particularly in terms of their difficulties with fat-tailed rare events.
“So what? These models are generally accurate most of the time, so they must be good at prediction.”
If you’re thinking this way, you may be mistaken. Rare events occur more frequently than we might intuitively believe. While a single rare occurrence is, by definition, rare, some rare event is happening to someone somewhere at all times.
Given identical circumstances, any individual has the capacity to disrupt observed patterns on any occasion, thus defining a rare event. Such is the nature of human behavior: inherently unpredictable and chaotic.
If you want real-world verification, speak with any business owner who has invested in advertising services from Google or Facebook and ask them how effective those models have proven to be. You may be surprised by the absurdities they encounter. When was the last time you felt frustrated by a recommendation on Netflix or Amazon? I suspect it happens more often than you’d care to admit.
Final Thoughts
In conclusion, your ‘smart’ AI assistants—Alexa, Siri, Cortana, and others—are primarily focused on slightly outperforming their rivals (perhaps by 1%) to capture a larger share of the advertising market. These services are not designed to accurately predict every aspect of human behavior.
Even if they aspired to do so, current methodologies are inadequate for predicting all facets of human conduct. The idea that big tech can utilize data for flawless prediction or spying is a fallacy!
Reference and credit: Jordon Ellenberg.
For additional reading, consider: How To Really Understand The Philosophy Of Inferential Statistics? and How To Really Benefit From Curves Of Constant Width?
If you appreciate my work as an author, consider supporting me on Patreon.