The Day Our Models Got Too Smart For Their Own Good: Fighting Data Leakage at Factal

Ritvik Mandyam
Mar 10
4 min read

Here at Factal, we love data. We love machine learning. And, let’s be honest, we love seeing validation scores go up. There’s a distinct dopamine hit when you look at a metric dashboard and see accuracy hovering around 97%, 98%, or even the mythical 99%.

For a brief, shining moment last week, we thought we were gods. Our latest batch of models - designed to automate some internal categorizations - weren't just performing well; they were operating at an "Oracle" tier of intelligence. We were moments away from popping champagne, retiring early, and letting the AI run the company.

Then, the cynicism set in. We aren't just engineers; we are suspicious engineers. We’ve been hurt before. If a machine learning model seems too good to be true, it’s usually because it is.

We stopped the party planning and started digging into why the models were so incredibly accurate.

The Cheat Sheet in the Data

It turns out we weren't building geniuses; we were building excellent test-takers who had smuggled the answer key into the exam.

Let’s look at a simplified example. Suppose we have a model trying to predict the State where an order originated.

If the input features look like this:

Order_ID: 1234
Item: Widget A
Timestamp: 2023-10-26 14:00
State_Code: "PA"

...and the target we are trying to predict is Pennsylvania.

Our model looked at that data and immediately "learned" that "PA" always meant "Pennsylvania." It was achieving 100% accuracy on that training set because it was essentially a glorified look-up table, not a pattern-recognition engine.

This, dear readers, is Data Leakage. It is the insidious phenomenon where information about the target variable is included in the training data, allowing the model to "predict" the target during testing without actually learning anything useful about how the world works.

Why You, The User, Should Care (A Lot)

You might be thinking, "Hey, We, why are you complaining? A model that always gets it right is a good model!"

This is the central lie of machine learning. A leaky model is perfectly accurate on past data, and perfectly useless on future data.

If we shipped a model that learned that "PA" = "Pennsylvania," and then you, the user, submitted a new form where the State_Code field was empty or formatted differently, the model would completely collapse. It didn't learn why an order was from Pennsylvania (location of vendor, shipping routes, historical patterns, etc.). It just learned a shortcut.

When we deliver a feature to you that involves AI, our contract is that it will provide value in the real world, not just look good on a leaderboard during testing. Fighting leakage is how we ensure that our automation actually saves you time, rather than requiring you to babysit it when it encounters a single new input.

We prefer "honest, slightly lower accuracy" over "fake 99% accuracy" because the honest model will still work next Tuesday.

How We Solved It: The Two-Step Leakage Filter

Once we realized our models were cheating, we decided to get proactive. We implemented a two-part automated heuristics system to violently reject leaky features before they can corrupt a final model.

Here is how we now protect your data (and our own sanity):

Defense Layer 1: The Mutual Information Filter

First, we apply some statistics. We calculate the Mutual Information between every single input feature and the target variable.

Without getting too math-heavy, Mutual Information measures how much knowing one variable tells you about another. If a feature and the target have a mutual information score that is obscenely high, we flag it. It’s highly likely that the feature is just a proxy for the target (like State Code and State, or Zip Code and City).

If the score passes a certain threshold (our "suspicion limit"), that feature is automatically rejected. It doesn't even get a chance to see the inside of the main training loop.

Defense Layer 2: The "Bouncer" Model

Sometimes, leakage isn't straightforward. It might take a combination of a few features to leak the target, or the statistical relationship might be subtle enough to bypass the Mutual Information filter.

For this, we deploy what we call the Bouncer Model. This is a very simple, incredibly lightweight model (think a shallow decision tree) that we can train in seconds.

We give the Bouncer a batch of features and tell it to predict the target. Then, we look at the resulting Feature Importances.

If the Bouncer comes back and says, "Feature X is responsible for 95% of my accuracy," we know exactly what is going on. Feature X is a snitch.

We treat this feature like someone trying to use a fake ID at a very serious club. The Bouncer ejects it immediately.

The Result: Trustworthy Automation

After implementing these defenses, our models stopped being magically 99% accurate. They dropped down to much more realistic numbers (like 85-90% accuracy).

But while the scores went down, our confidence went way up.

By iterating through this process - detecting leakage, rejecting features, and retraining until we don't find any "cheat codes" - we are forcing the models to actually learn the difficult patterns in the data. We know that the models we build today will continue to provide accurate, reliable results for Factal users tomorrow, next month, and when the input data changes.

It turns out that being a suspicious bastard pays off in the end.