Why Fraud Detection Models Block the Wrong 1% of Transactions

The false positive next to the missed fraud. The one the model caught is the one that mattered least.

I trained a LightGBM model on the public Kaggle credit card fraud dataset. 284,807 transactions. The fraud rate was 0.17%, a 577 to 1 class imbalance. After tuning, it caught 84% of fraud while flagging just 0.28% of all transactions. ROC-AUC over 0.95. PR-AUC over 0.90.

Then I asked a different question. Not “is the model good” but “what does this cost a real payments business.”

At a million transactions a month, 0.28% is 2,800 flagged transactions. A handful are real fraud. The rest are good customers at checkout watching their card decline.

The model is working. The business is bleeding.

Why does the wrong 1% cost more than the fraud you stop?

A blocked legitimate customer costs you their acquisition spend plus their lifetime value. A chargeback costs you the disputed amount plus a fee. For a fintech, that means 100 to 300 dollars of CAC on the line versus 50 to 100 dollars on a typical chargeback. LTV widens the gap. The math rarely favors blocking aggressively.

The asymmetry compounds over time. A chargeback is a single event with a known dollar value. A blocked customer is a lifetime of revenue gone. They tell people. They never come back. The cost lands in marketing’s CAC report two quarters later, far from the fraud dashboard that created it.

This is one reason chasing 95% accuracy is the wrong fight. The accuracy number hides the asymmetry.

What does the precision-recall trade-off actually look like?

Precision recall curve

PR curve from the portfolio model. Each point is a threshold. Each threshold is a different business outcome.

A precision-recall curve shows what you give up at each threshold. Move the threshold down, you catch more fraud and flag more good customers. Move it up, you flag fewer transactions but miss more fraud. The curve is the whole conversation.

On the Kaggle dataset I worked with, the model hit 84% recall at a 0.28% overall alert rate. Push recall to 90% and the alert rate climbs. Push it to 95% and you are flagging closer to 1.5% of all traffic. At a million transactions a month, the difference between 0.28% and 1.5% is roughly 12,200 extra customers held up at checkout.

That is the trade. Not “is the model accurate” but “how many good customers can you afford to block to catch a few more fraudsters.”

The full LightGBM model and PR analysis are on my GitHub, and the full walkthrough of the model behind these numbers covers the imbalanced training setup.

Why do most teams default to high recall and lose customers?

Risk teams report up to compliance or to a CFO worried about fraud loss. Their dashboards track fraud caught and chargebacks prevented. Those are the numbers the board sees.

Nobody on the risk side owns the count of good customers blocked. Marketing might track it as declined checkout abandonment, and support sees it as card-declined tickets. The growth team watches CAC creep up and blames the ad channel.

The data exists. It just sits in different tools owned by different people. So the threshold moves one way only: up. The team paying the cost has no hand on the lever.

This is the gap most fraud detection conversations miss. The model is rarely the problem. The org chart is.

If your risk team can quote fraud caught but can’t quote good customers blocked, that’s the gap. Book a call.

How do you pick the right threshold?

You weight each error by what it costs your business.

Define two costs. Let C_fp be the cost of one false positive (blocked good customer) and C_fn the cost of one false negative (missed fraud). For each candidate threshold, the model produces a false positive rate and a false negative rate on a held-out set. Total expected loss per transaction is:

Total cost = C_fp * FPR * P(legit) + C_fn * FNR * P(fraud)

You pick the threshold that minimizes the total cost.

Cost-weighted threshold diagram showing two stacked cost columns, one for false positive cost and one for false negative cost, with a slider visualization indicating the optimal cutoff where the two costs balance.

Two cost columns. One slider. The optimum is where total cost is lowest, not where recall is highest.

Here is the math with real numbers. Suppose you process one million transactions a month at a 0.17% fraud rate. Average fraud transaction is 150 dollars. Average blocked customer costs you 80 dollars in lost LTV plus support time.

At threshold A: FPR is 0.28%, FNR is 16%. Expected monthly cost is 2,795 blocked customers times 80 dollars plus 272 missed fraud transactions times 150 dollars. That is 223,600 plus 40,800. Total: 264,400 dollars.

At threshold B (more lenient): FPR is 0.1%, FNR is 25%. Expected monthly cost is 998 blocked times 80 plus 425 missed times 150. That is 79,840 plus 63,750. Total: 143,590 dollars.

Threshold B costs less. The model did not change.

Here is the snippet that does the search:

import numpy as np
from sklearn.metrics import precision_recall_curve

# Probabilities from your trained LightGBM model on a held-out set
y_true = np.array([...])       # 0 = legitimate, 1 = fraud
y_scores = np.array([...])     # model.predict_proba(X_test)[:, 1]

# Business costs (set these from your finance and growth teams)
cost_fp = 80     # blocked good customer: LTV loss plus support time
cost_fn = 150    # missed fraud transaction: chargeback plus fees

_, _, thresholds = precision_recall_curve(y_true, y_scores)

best_t, best_cost = None, float("inf")
for t in thresholds:
    preds = (y_scores >= t).astype(int)
    fp = ((preds == 1) & (y_true == 0)).sum()
    fn = ((preds == 0) & (y_true == 1)).sum()
    total_cost = fp * cost_fp + fn * cost_fn
    if total_cost < best_cost:
        best_cost, best_t = total_cost, t

print(f"Optimal threshold: {best_t:.4f}")
print(f"Expected monthly cost at this threshold: ${best_cost:,.0f}")

Run this on a representative held-out set. Update cost_fp and cost_fn as your business changes.

What changes when your fraud loss tolerance changes?

The cost weights change with your business model. A bank pays the chargeback. A marketplace eats some of it. A payment processor charges the merchant and walks away. A B2B SaaS fintech might rarely see fraud at all but cares deeply about onboarding friction. Each one has a different optimal threshold on the same model.

Tolerance comparison diagram showing four business types stacked vertically with their fraud tolerance and customer experience priority indicated by horizontal bars

Same model, four businesses, four cutoffs. The tolerance is the lever, not the algorithm.

Business typeFraud loss toleranceCustomer experience priorityTypical FPR target
High-volume marketplaceHigher (1 to 2% of GMV)High (repeat purchase economy)0.5% to 1.5%
Consumer fintech appLow (regulator scrutiny)High (acquisition is expensive)Under 0.3%
Card-issuing bankVery low (liability)Medium (sticky customers)0.3% to 0.5%
B2B payment processorModerate (merchant absorbs cost)Medium (contract-driven)0.5% to 1%

The takeaway is not a number. It is that the same LightGBM model works for all four with a different cutoff. If your fraud team is using a default threshold across all customer segments, ask why.

What should a founder ask the risk team this week?

Four questions. Each one isolates a piece of the gap.

  1. What is our false positive rate, expressed as a percentage of all transactions and as a count of blocked customers per million? If the team cannot say, the metric is not owned.
  2. What threshold are we using, and when did we last review it? Models drift, and so does customer mix. A threshold set 18 months ago is almost certainly wrong now.
  3. What is the dollar value of legitimate transactions we blocked last month? This is the line item that closes the loop with finance. If it is missing, you cannot compare it against fraud loss.
  4. If we lowered the threshold by 10%, what would happen to recall and to the false positive count? This is a one-afternoon analysis on historical data. There is no excuse for not having an answer.

If you would rather have someone tune this for you, I work on this with founders directly.

Running a payments or marketplace platform and want a second opinion on your fraud threshold? I work with fintech founders on this exact problem. Book a 20-minute call.

Frequently Asked Questions

What is a false positive in fraud detection?

A legitimate transaction the model flags as fraud. The customer gets declined or held up at checkout even though the purchase was real. False positives are the hidden cost of aggressive fraud thresholds.

Why does false positive rate matter more than accuracy?

Fraud datasets are heavily imbalanced. A model can score 99% accuracy by predicting everything as not-fraud. False positive rate captures how often the model wrongly blocks legitimate customers, which is the metric a payment platform actually feels.

What is a good false positive rate for a fraud model?

It depends on your fraud loss tolerance and customer acquisition cost. A marketplace can usually tolerate 0.5 to 2% false positives, while a high-trust fintech app may need under 0.3%. There is no universal target.

How do I lower false positives without missing fraud?

Tune the threshold, not the model. Most teams overweight model accuracy and underweight threshold selection. A 5% recall reduction at the right threshold often saves more in retained customers than it costs in missed fraud.

What metrics should I ask my fraud team to report?

Three numbers at minimum: recall on fraud, false positive rate, and dollar value of blocked legitimate transactions per million processed. The third one is the metric most teams do not track.

Is high precision or high recall better for fraud detection?

Neither alone. The right answer is the cost-weighted blend. If a missed fraud costs 50 USD and a blocked customer costs 200 USD in lifetime value, precision wins. If the ratio inverts, recall wins.