1Introduction
Most “AI-powered” lead scoring ships as a black box: a confident number, no evidence, no way to check it. We think that’s backwards.
Agents and team leads make real money decisions on these scores — who to call first, which deal to save, where the next listing is likely hiding. A score that drives money decisions should be held to the same standard as any other business number: show your work. This report is that work, in plain language.
It covers what each of Trove’s five models predicts (Section 2), how we measure them and why the measurement method matters more than the headline figure (Section 3), the current held-out results with a plain-English reading of each (Section 4), the calibration property that makes a probability trustworthy (Section 5), how we cover a whole database rather than a flattering sample (Section 6), and — just as important — the limits we publish on purpose (Section 7). A glossary (Section 9) defines every technical term the first time it is worth defining.
The same discipline runs inside the product. Trove’s Book of Business includes a Model Quality view that grades our own predictions on your account: how many predictions have been checked against real outcomes, and how predicted probabilities compared to what actually happened. A number you can audit is worth more than a bigger number you can’t. That is the whole philosophy.
2The five models
Trove’s model suite answers five distinct questions about every contact in your book. Each is trained on labeled outcomes from real CRM activity — what contacts actually did, not what a rule guessed they would do.
| Model | The question it answers |
|---|---|
| Win (conversion) | Will this contact convert to a closed deal? Powers the calibrated 0–100 Ace Win Score. |
| Response | Will this contact respond to outreach? Helps order the daily queue and time reach-outs. |
| Churn risk | Will this engaged contact go cold? Powers the Ace Churn Risk read and save-lists. |
| Appointment propensity | Will this contact book an appointment? The mid-funnel signal — who is likely to get on a calendar. |
| Full-book fallback (conversion) | A conversion read for the widest slice of the book — including contacts with very thin history the headline models can’t score confidently. |
All five are trained under a strict rule: a training example may only see information that existed before its outcome cutoff, so a model can never “predict” something it was quietly told. The models are retrained weekly on fresh outcomes, the full scored population is refreshed on a rolling weekly pass, and updated scores land on your contacts and in Follow Up Boss fields nightly. The rest of this report is about how we know any of that actually works.
3Evaluation methodology
Every number in Section 4 is a held-out evaluation. That phrase carries the entire weight of whether these figures mean anything, so it is worth spelling out.
- A slice of history is set aside before training. The model never sees it — not during training, not for tuning.
- The model predicts on that slice. For each held-out contact it produces a probability, blind to what actually happened.
- Predictions are compared to recorded reality. The held-out examples have known outcomes — the contact converted or didn’t, responded or didn’t, went cold or didn’t — so the comparison is against facts, not assumptions.
Because the model never saw those examples, held-out results measure genuine predictive skill rather than memorization. A model can ace its own training data and be useless in the wild; held-out evaluation is how you catch that before your agents do.
Why leakage is the silent killer of vendor metrics
The most impressive AI numbers in this industry are often the least trustworthy, and the reason is almost always leakage: the model was accidentally allowed to see a scrap of the future when it made its “prediction.” Imagine training a model to predict whether a deal closes, and one of the columns it learns from is the date the deal closed. It will look nearly perfect in testing and fall apart in production, because in production that column is empty at the moment you need the prediction.
Leakage is what lets a vendor honestly report a spectacular figure that their product can never reproduce for you. Guarding against it is unglamorous and it is the whole game. Our defense is the time rule from Section 2 — an example may only learn from information that existed before its outcome was known — enforced at training time, plus held-out slices that a model has genuinely never touched. A figure earned this way is lower than a leaky one and it is the only kind worth publishing.
The loop, not the lab result
Just as important: this is not a one-time lab number. Evaluation here is a weekly loop, illustrated in Figure 1.
Three parts of that loop deserve a sentence each:
- The champion/challenger guard. A retrained candidate does not ship just because it is newer. It has to evaluate at least as well as the model currently deployed, or the incumbent stays. A candidate that quietly regressed never reaches your account.
- Continuous outcome labeling. Live predictions are compared to what actually happened as it happens — did the contact respond or book within 24 hours? Within 7 days? Did they go cold? — with closed-deal results feeding a longer-horizon track. Those labels are what the next retrain and the next recalibration learn from.
- Per-account calibration verification. The loop is not just internal. The Model Quality view inside Trove runs the predicted-versus-actual comparison on your own account’s predictions, so the claim in Section 5 is checkable on your data, not just ours.
Figure 2 is the part of the loop most vendors skip: every prediction is kept and later checked against the outcome it claimed.
4Results: held-out AUC
The standard measure for models like these is AUC — area under the ROC curve. Table 1 gives the held-out AUC from the July 2, 2026 evaluation of the five production models, with a plain reading of what each figure means for the work.
| Model | Held-out AUC | What this means operationally |
|---|---|---|
| Win (conversion) | 0.9654 | The strongest ranker in the suite. Sorts the book so eventual closers sit near the top — a call-from-the-top list is dense with real deals. |
| Response | 0.9243 | Orders who will reply, so outreach and the daily queue lead with the contacts most likely to answer today. |
| Churn risk | 0.9092 | Separates engaged contacts about to go quiet from those who will stay warm, so save-lists target the right people before they cool. |
| Appointment propensity | 0.8890 | Flags who is close to getting on a calendar — the mid-funnel nudge between “replied” and “under contract.” |
| Full-book fallback (conversion) | 0.7473 | Covers the thin-history slice the headline models can’t score, ranking roughly three of four pairs correctly where the alternative is no read at all (see Section 6). |
Table 1. Held-out AUC by model, July 2, 2026 evaluation. 0.50 is coin-flip ranking; 1.00 is a perfect ranking.
How to read an AUC: the ROC walkthrough
AUC answers one specific question — how well does the model rank? Here is the whole idea in one sentence you can hold onto:
Pick a random contact who actually converted and a random contact who didn’t. The Win model ranks the converter higher than the non-converter 96.5% of the time — that is exactly what an AUC of 0.9654 means.
A coin flip scores 0.5 — no ranking skill at all. A perfect ranker scores 1.0. Every figure in Table 1 is this same head-to-head win rate, computed across every convert / non-convert pair in the held-out data. It is a measure of order, and order is what a prioritized call list is made of.
What AUC is not — read this before quoting a number
An AUC of 0.9654 does not mean the model is “96.5% accurate.” AUC says nothing about the percentage of predictions that are “correct.” It measures ranking quality — whether the model puts likelier converters ahead of less likely ones — not the share of individual calls it gets right.
We deliberately avoid quoting “accuracy” at all, because for rare outcomes it is a broken metric. In a book where 1% of contacts convert, a useless model that predicts “won’t convert” for everyone is 99% “accurate” and helps you with nothing. Ranking quality (AUC) and calibration (Section 5) are the honest measures for this job.
One more caveat you should hear from us: AUC depends on the population you evaluate on. A book with many clearly-inactive contacts is easier to rank than a set of look-alike warm leads, so the same model can post different AUCs on different slices of data (Section 7 puts a number on this). Read Table 1 as ranking skill on our held-out evaluation data — a dated, checkable snapshot — not a universal grade that applies identically everywhere.
5Calibration: the claim that matters most
AUC tells you the ranking is good. It does not tell you that a score of 30 means a 30% chance — a model can rank perfectly while its probabilities run systematically hot or cold. That second property is calibration, and for an agent deciding how to spend a morning, it is the one that matters most. AUC orders your book; calibration is what lets you trust the number attached to each name.
Here is the claim, stated plainly: a Win Score of 30 means roughly a 30% chance of converting. And here is why we can make it. Every prediction the system makes is recorded and then labeled against real outcomes — did the contact respond or book within 24 hours? Within 7 days? Did they go cold? Closed-deal results feed a longer-horizon track. The win-score probability mapping is then re-fit against those observed rates: a monotone recalibration that never changes the order of contacts, only pulls the stated probabilities into line with what your market actually produced. Because it is re-fit as outcomes mature, the probabilities move with the market instead of drifting away from it.
The reliability curve, in concept
The standard way to see calibration is a reliability curve: group predictions into bands, and for each band plot the average predicted probability against the rate the outcome actually occurred. A perfectly calibrated model lands on the diagonal — predicted equals observed. Figure 4 shows the shape you are looking for.
The practical consequence: a Smart List of contacts scored around 30 should convert at roughly half the rate of a list scored around 60. Both the ordering and the magnitude are meaningful — which is what separates a probability from a vanity score. And you do not have to take this section on faith: the Model Quality view inside Trove plots exactly this curve on your own account’s predictions, band by band, so calibration is something you can check rather than something we assert.
6Coverage tiers: why we publish a 0.7473 model
The number in Table 1 that tells you the most about how we operate isn’t the 0.9654. It’s the 0.7473.
The four headline models do their best work where there is engagement history to read — calls, texts, replies, appointments. But a big share of any real database is thin: old imports, leads that never engaged, contacts with barely any recorded activity. A model tuned for rich signal cannot score those contacts confidently, and there are two dishonest ways to handle that: leave the slice silently unscored, or pretend the strong models cover it. We do neither.
Instead, contacts are scored in coverage tiers. A contact with enough history is read by the headline models. A contact too thin for a confident headline read falls to a full-book fallback conversion model, built to cover the widest possible slice of the book with the thinnest available signal — so the whole database gets a read, not just the flattering part of it.
The fallback’s held-out AUC is 0.7473. That is meaningfully below the headline models — and still far better than chance: a coin flip is 0.5, and 0.7473 means the fallback puts the eventual converter ahead of a random non-converter roughly three times out of four. For a portion of your database that would otherwise have no model read at all, that is a genuinely useful ranking. When a contact accumulates richer history, the stronger models take over.
We publish the lower number on purpose. Widest coverage, thinnest signal, honestly labeled. A vendor that only shows you its best number is telling you something about all its numbers.
7Limitations
A report that only lists strengths is marketing. Here is what these models cannot do, stated as plainly as the results.
They rank and estimate. They don’t guarantee. A contact at 80 fails to convert one time in five; a contact at 10 occasionally closes. A calibrated probability is a planning tool for allocating attention across a whole book — it is not a verdict on any single person.
Population dependence is real, and here is the size of it. Because AUC depends on the evaluation population (Section 4), the same win model that posts 0.9654 on our held-out slice ranks in the high 0.80s on a harder, look-alike slice of warm leads where every contact already looks promising. Both are honest; they are simply different questions. Treat the headline figures as ranking skill on the stated evaluation data, and expect your own account’s numbers — visible in Model Quality — to sit somewhere in that range depending on how your book is composed.
They measure behavior, not intent. The models learn largely from engagement behavior, a strong but imperfect proxy. The serious buyer who never replies to anything will score low. Use the scores to decide where your personal attention goes first, not as an exclusion filter that writes people off.
Rare outcomes are hard, and honesty about them is the point. Conversion is uncommon in most books, which is exactly why we report ranking and calibration rather than “accuracy,” and why the fallback tier exists rather than a single number stretched over everyone.
Markets shift, so a snapshot is a snapshot. The models retrain weekly and calibration re-fits as outcomes mature, so held-out results move as markets and data change. The figures here are from the July 2, 2026 evaluation; we would rather show a dated, checkable number than an undated, permanent-looking one.
Modeled dollars are always labeled “(est.).” Anywhere a probability is turned into money — expected commission, dollars at risk, weighted forecasts — the figure is computed, not observed, and carries an “(est.)” label everywhere it appears. If we calculated it rather than counted it, we say so.
Scope note. The Seller Radar 0–100 seller-propensity score is a separate, transparent weighted-signal score — its honesty mechanism is showing you the evidence behind every score. It is not one of the five models measured in this report, and these AUC figures apply to the predictive model suite only.
8Frequently asked questions
Q.Does a 0.9654 AUC mean the model is 96.5% accurate?
No. AUC is not accuracy. It means that when you pick one contact who converted and one who didn’t, the model ranks the converter higher about 96.5% of the time. It says nothing about the share of individual predictions that are “correct” — and for rare outcomes, “accuracy” is a misleading metric anyway, which is why we report ranking quality and calibration instead.
Q.What is a held-out evaluation, and why does it matter?
A slice of labeled history is set aside before training and never shown to the model. The model then predicts on that slice, and its predictions are compared to the real recorded outcomes. Because the model never saw those examples, the result measures genuine predictive skill rather than memorization — and it is the discipline that prevents leakage, where a model is accidentally allowed to see the future and posts a number it can’t reproduce in production.
Q.How often are the models re-evaluated?
Weekly retrains with fresh held-out evaluations (a candidate that evaluates worse than the deployed model is never promoted), continuous outcome-labeling of live predictions against real 24-hour and 7-day results, calibration re-fits against those observed rates, and nightly score syncs to Follow Up Boss fields. The evaluation is a living loop, not a launch-day certificate.
Q.What does “calibrated” mean for the Ace Win Score?
The stated probability matches observed reality: contacts scored 30 convert at roughly a 30% rate over time. Predictions are labeled against real outcomes and the probability mapping is re-fit against those observed rates as markets shift — and the Model Quality view inside Trove shows the predicted-versus-actual comparison on your own account.
Q.Why publish a 0.7473 model next to ones above 0.90?
Because a real database has a large slice of thin-history contacts the strongest models can’t score confidently. Rather than leave them unscored, a full-book fallback covers that slice with the thinnest available signal. Its 0.7473 is well below the headline models and still far better than a coin flip — it ranks the eventual converter first about three times in four, for contacts that would otherwise get no read at all.
Q.Do these AUC figures apply to the Seller Radar score?
No. The Seller Radar 0–100 seller-propensity score is a separate, transparent weighted-signal score whose honesty mechanism is showing you the evidence behind every score. It is not one of the five predictive models measured here, and these AUC figures apply only to the predictive model suite.
9Glossary
AUC (area under the ROC curve)
A measure of ranking quality from 0.5 to 1.0. It equals the probability that the model ranks a randomly chosen positive example (e.g. a converter) above a randomly chosen negative one. 0.5 is a coin flip; 1.0 is a perfect ranking. AUC is not accuracy.
Held-out evaluation
Scoring a model on data set aside before training and never shown to it. Because the model never saw those examples, the result measures real predictive skill rather than memorization — and it is the primary defense against leakage.
Leakage
When a model is accidentally allowed to learn from information that wouldn’t exist at prediction time (a scrap of the future). It inflates test numbers and collapses in production. Enforcing that examples only learn from information predating their outcome is how we prevent it.
Calibration
The property that a stated probability matches the observed rate: contacts scored 30 convert about 30% of the time. A model can rank well (high AUC) yet be poorly calibrated; calibration is re-fit against real outcomes so the number itself is trustworthy.
Outcome labeling
Retaining each prediction and later attaching what actually happened — responded or not, booked or not, went cold or not, closed or not — as the real result matures. These labels feed the weekly retrain and the calibration re-fit.
Coverage tier
Which model scores a given contact, based on how much history that contact has. Rich-history contacts are read by the headline models; thin-history contacts fall to the full-book fallback so the whole database gets a read, not just the flattering part of it.
Champion / challenger guard
The promotion rule: a retrained candidate (challenger) replaces the deployed model (champion) only if it evaluates at least as well on held-out data. A candidate that quietly regressed never reaches your account.
Want the product story around these models? Read the Ace Trove launch announcement for what the suite does day to day, the predictive lead scores guide for what each field means on a contact, or the complete Ace Trove overview for the whole account-wide layer.
Scores you can audit, on your own database
Ace Win Score and Churn Risk populate free on every connected account — and Trove’s Model Quality view grades the predictions against your own outcomes.
Get Started Free →