How often are the Ace Trove models re-evaluated?

Continuously, on a rolling schedule: the models are retrained and re-evaluated weekly on fresh held-out data, a retrained candidate is only promoted if it evaluates at least as well as the deployed model, predictions are outcome-labeled against real 24-hour and 7-day results as they mature, calibration is re-fit against those observed rates, and updated scores land in Follow Up Boss fields nightly. The figures in this report are a dated snapshot of that living process, not a one-time lab result.

What does it mean that the Ace Win Score is calibrated?

A calibrated score is a probability that matches reality: contacts scored 30 should convert at roughly a 30% rate over time. Ace records every prediction, labels it against what actually happened — real 24-hour and 7-day outcomes, with closed-deal results on a longer track — and re-fits the probability mapping against those observed rates as markets shift. AUC tells you the ranking is good; calibration is what makes the number itself trustworthy. Inside Trove, the Model Quality view plots predicted probability against actual outcome rate on your own account's predictions.

How We Evaluate the Ace Trove Models — A Technical Evaluation Report

Q: Does a 0.9654 AUC mean the model is 96.5% accurate?

No. AUC is not accuracy. An AUC of 0.9654 means that if you pick one contact who actually converted and one who didn't, the model ranks the converter higher about 96.5% of the time. It says nothing about the percentage of predictions that are 'correct.' With rare outcomes, accuracy is a misleading metric anyway — a useless model that predicts 'won't convert' for everyone can be over 99% 'accurate' — which is why we report ranking quality (AUC) and calibration instead.

Q: What is a held-out evaluation, and why does it matter?

Before a model is trained, a slice of the labeled data is set aside and never shown to it. After training, the model makes predictions on that held-out slice, and those predictions are compared to the real recorded outcomes. Because the model never saw those examples, the result measures genuine predictive skill rather than memorization. Held-out discipline is also how we prevent leakage — the silent killer of vendor metrics — where a model is accidentally allowed to see information from the future and posts an impressive number it can't reproduce in production. Every AUC figure in this report comes from held-out evaluation, not training-set performance.

Abstract

Ace Trove scores every contact in a Follow Up Boss database with five production machine-learning models: the probability a contact converts to a closed deal (the calibrated Ace Win Score), the probability they respond, the risk an engaged contact goes cold, the likelihood they book an appointment, and a full-book fallback that reaches contacts with the thinnest history. This report documents how those models are measured and what the measurements mean. Every figure comes from held-out evaluation — the model is scored on outcomes it never saw during training — most recently run July 2, 2026. We report area under the ROC curve (AUC) for ranking quality and calibration for probability accuracy, explain why we deliberately avoid the word “accuracy,” and publish the limits: population dependence, rare-outcome caveats, and the fact that a calibrated probability ranks a book — it does not guarantee an individual. The same checks run inside the product, on each account’s own outcomes.

1Introduction

Most “AI-powered” lead scoring ships as a black box: a confident number, no evidence, no way to check it. We think that’s backwards.

Agents and team leads make real money decisions on these scores — who to call first, which deal to save, where the next listing is likely hiding. A score that drives money decisions should be held to the same standard as any other business number: show your work. This report is that work, in plain language.

It covers what each of Trove’s five models predicts (Section 2), how we measure them and why the measurement method matters more than the headline figure (Section 3), the current held-out results with a plain-English reading of each (Section 4), the calibration property that makes a probability trustworthy (Section 5), how we cover a whole database rather than a flattering sample (Section 6), and — just as important — the limits we publish on purpose (Section 7). A glossary (Section 9) defines every technical term the first time it is worth defining.

The same discipline runs inside the product. Trove’s Book of Business includes a Model Quality view that grades our own predictions on your account: how many predictions have been checked against real outcomes, and how predicted probabilities compared to what actually happened. A number you can audit is worth more than a bigger number you can’t. That is the whole philosophy.

2The five models

Trove’s model suite answers five distinct questions about every contact in your book. Each is trained on labeled outcomes from real CRM activity — what contacts actually did, not what a rule guessed they would do.

Model	The question it answers
Win (conversion)	Will this contact convert to a closed deal? Powers the calibrated 0–100 Ace Win Score.
Response	Will this contact respond to outreach? Helps order the daily queue and time reach-outs.
Churn risk	Will this engaged contact go cold? Powers the Ace Churn Risk read and save-lists.
Appointment propensity	Will this contact book an appointment? The mid-funnel signal — who is likely to get on a calendar.
Full-book fallback (conversion)	A conversion read for the widest slice of the book — including contacts with very thin history the headline models can’t score confidently.

All five are trained under a strict rule: a training example may only see information that existed before its outcome cutoff, so a model can never “predict” something it was quietly told. The models are retrained weekly on fresh outcomes, the full scored population is refreshed on a rolling weekly pass, and updated scores land on your contacts and in Follow Up Boss fields nightly. The rest of this report is about how we know any of that actually works.

3Evaluation methodology

Every number in Section 4 is a held-out evaluation. That phrase carries the entire weight of whether these figures mean anything, so it is worth spelling out.

A slice of history is set aside before training. The model never sees it — not during training, not for tuning.
The model predicts on that slice. For each held-out contact it produces a probability, blind to what actually happened.
Predictions are compared to recorded reality. The held-out examples have known outcomes — the contact converted or didn’t, responded or didn’t, went cold or didn’t — so the comparison is against facts, not assumptions.

Because the model never saw those examples, held-out results measure genuine predictive skill rather than memorization. A model can ace its own training data and be useless in the wild; held-out evaluation is how you catch that before your agents do.

Why leakage is the silent killer of vendor metrics

The most impressive AI numbers in this industry are often the least trustworthy, and the reason is almost always leakage: the model was accidentally allowed to see a scrap of the future when it made its “prediction.” Imagine training a model to predict whether a deal closes, and one of the columns it learns from is the date the deal closed. It will look nearly perfect in testing and fall apart in production, because in production that column is empty at the moment you need the prediction.

Leakage is what lets a vendor honestly report a spectacular figure that their product can never reproduce for you. Guarding against it is unglamorous and it is the whole game. Our defense is the time rule from Section 2 — an example may only learn from information that existed before its outcome was known — enforced at training time, plus held-out slices that a model has genuinely never touched. A figure earned this way is lower than a leaky one and it is the only kind worth publishing.

The loop, not the lab result

Just as important: this is not a one-time lab number. Evaluation here is a weekly loop, illustrated in Figure 1.

Figure 1. The evaluation loop. Models are trained, evaluated on data held out from training, promoted only through a champion/challenger guard, deployed, then have their live predictions labeled against real outcomes and recalibrated — weekly.

Three parts of that loop deserve a sentence each:

The champion/challenger guard. A retrained candidate does not ship just because it is newer. It has to evaluate at least as well as the model currently deployed, or the incumbent stays. A candidate that quietly regressed never reaches your account.
Continuous outcome labeling. Live predictions are compared to what actually happened as it happens — did the contact respond or book within 24 hours? Within 7 days? Did they go cold? — with closed-deal results feeding a longer-horizon track. Those labels are what the next retrain and the next recalibration learn from.
Per-account calibration verification. The loop is not just internal. The Model Quality view inside Trove runs the predicted-versus-actual comparison on your own account’s predictions, so the claim in Section 5 is checkable on your data, not just ours.

Figure 2 is the part of the loop most vendors skip: every prediction is kept and later checked against the outcome it claimed.

A conveyor of glowing prediction cards passing under a copper arch that scans and stamps each one against its real recorded outcome — Figure 2. Outcome labeling. Each prediction the system makes is retained and, as the real result matures, compared to what actually occurred. Those labels feed both the weekly retrain and the calibration re-fit — the mechanism that keeps a published number honest between updates.

4Results: held-out AUC

The standard measure for models like these is AUC — area under the ROC curve. Table 1 gives the held-out AUC from the July 2, 2026 evaluation of the five production models, with a plain reading of what each figure means for the work.

Model	Held-out AUC	What this means operationally
Win (conversion)	0.9654	The strongest ranker in the suite. Sorts the book so eventual closers sit near the top — a call-from-the-top list is dense with real deals.
Response	0.9243	Orders who will reply, so outreach and the daily queue lead with the contacts most likely to answer today.
Churn risk	0.9092	Separates engaged contacts about to go quiet from those who will stay warm, so save-lists target the right people before they cool.
Appointment propensity	0.8890	Flags who is close to getting on a calendar — the mid-funnel nudge between “replied” and “under contract.”
Full-book fallback (conversion)	0.7473	Covers the thin-history slice the headline models can’t score, ranking roughly three of four pairs correctly where the alternative is no read at all (see Section 6).

Table 1. Held-out AUC by model, July 2, 2026 evaluation. 0.50 is coin-flip ranking; 1.00 is a perfect ranking.

Figure 3. Held-out AUC by model. The dashed amber line at 0.50 is coin-flip ranking. Conventionally, AUC in the 0.70–0.80 range is strong and 0.80 and above is excellent; four models sit in the excellent band and the fallback sits in the strong band.

How to read an AUC: the ROC walkthrough

AUC answers one specific question — how well does the model rank? Here is the whole idea in one sentence you can hold onto:

The one-sentence reading

Pick a random contact who actually converted and a random contact who didn’t. The Win model ranks the converter higher than the non-converter 96.5% of the time — that is exactly what an AUC of 0.9654 means.

A coin flip scores 0.5 — no ranking skill at all. A perfect ranker scores 1.0. Every figure in Table 1 is this same head-to-head win rate, computed across every convert / non-convert pair in the held-out data. It is a measure of order, and order is what a prioritized call list is made of.

What AUC is not — read this before quoting a number

Do not call this “accuracy”

An AUC of 0.9654 does not mean the model is “96.5% accurate.” AUC says nothing about the percentage of predictions that are “correct.” It measures ranking quality — whether the model puts likelier converters ahead of less likely ones — not the share of individual calls it gets right.

We deliberately avoid quoting “accuracy” at all, because for rare outcomes it is a broken metric. In a book where 1% of contacts convert, a useless model that predicts “won’t convert” for everyone is 99% “accurate” and helps you with nothing. Ranking quality (AUC) and calibration (Section 5) are the honest measures for this job.

One more caveat you should hear from us: AUC depends on the population you evaluate on. A book with many clearly-inactive contacts is easier to rank than a set of look-alike warm leads, so the same model can post different AUCs on different slices of data (Section 7 puts a number on this). Read Table 1 as ranking skill on our held-out evaluation data — a dated, checkable snapshot — not a universal grade that applies identically everywhere.

5Calibration: the claim that matters most

AUC tells you the ranking is good. It does not tell you that a score of 30 means a 30% chance — a model can rank perfectly while its probabilities run systematically hot or cold. That second property is calibration, and for an agent deciding how to spend a morning, it is the one that matters most. AUC orders your book; calibration is what lets you trust the number attached to each name.

Here is the claim, stated plainly: a Win Score of 30 means roughly a 30% chance of converting. And here is why we can make it. Every prediction the system makes is recorded and then labeled against real outcomes — did the contact respond or book within 24 hours? Within 7 days? Did they go cold? Closed-deal results feed a longer-horizon track. The win-score probability mapping is then re-fit against those observed rates: a monotone recalibration that never changes the order of contacts, only pulls the stated probabilities into line with what your market actually produced. Because it is re-fit as outcomes mature, the probabilities move with the market instead of drifting away from it.

The reliability curve, in concept

The standard way to see calibration is a reliability curve: group predictions into bands, and for each band plot the average predicted probability against the rate the outcome actually occurred. A perfectly calibrated model lands on the diagonal — predicted equals observed. Figure 4 shows the shape you are looking for.

Figure 4. A calibration reliability curve (illustrative concept, not account data). Each point is a band of predictions: its horizontal position is the average predicted probability, its vertical position is the rate the outcome actually occurred. A well-calibrated model tracks the dashed diagonal, where predicted equals observed. The Model Quality view inside Trove draws this chart on your own account’s predictions.

The practical consequence: a Smart List of contacts scored around 30 should convert at roughly half the rate of a list scored around 60. Both the ordering and the magnitude are meaningful — which is what separates a probability from a vanity score. And you do not have to take this section on faith: the Model Quality view inside Trove plots exactly this curve on your own account’s predictions, band by band, so calibration is something you can check rather than something we assert.

6Coverage tiers: why we publish a 0.7473 model

The number in Table 1 that tells you the most about how we operate isn’t the 0.9654. It’s the 0.7473.

The four headline models do their best work where there is engagement history to read — calls, texts, replies, appointments. But a big share of any real database is thin: old imports, leads that never engaged, contacts with barely any recorded activity. A model tuned for rich signal cannot score those contacts confidently, and there are two dishonest ways to handle that: leave the slice silently unscored, or pretend the strong models cover it. We do neither.

Instead, contacts are scored in coverage tiers. A contact with enough history is read by the headline models. A contact too thin for a confident headline read falls to a full-book fallback conversion model, built to cover the widest possible slice of the book with the thinnest available signal — so the whole database gets a read, not just the flattering part of it.

The fallback’s held-out AUC is 0.7473. That is meaningfully below the headline models — and still far better than chance: a coin flip is 0.5, and 0.7473 means the fallback puts the eventual converter ahead of a random non-converter roughly three times out of four. For a portion of your database that would otherwise have no model read at all, that is a genuinely useful ranking. When a contact accumulates richer history, the stronger models take over.

We publish the lower number on purpose. Widest coverage, thinnest signal, honestly labeled. A vendor that only shows you its best number is telling you something about all its numbers.

7Limitations

A report that only lists strengths is marketing. Here is what these models cannot do, stated as plainly as the results.

Published limits

They rank and estimate. They don’t guarantee. A contact at 80 fails to convert one time in five; a contact at 10 occasionally closes. A calibrated probability is a planning tool for allocating attention across a whole book — it is not a verdict on any single person.

Population dependence is real, and here is the size of it. Because AUC depends on the evaluation population (Section 4), the same win model that posts 0.9654 on our held-out slice ranks in the high 0.80s on a harder, look-alike slice of warm leads where every contact already looks promising. Both are honest; they are simply different questions. Treat the headline figures as ranking skill on the stated evaluation data, and expect your own account’s numbers — visible in Model Quality — to sit somewhere in that range depending on how your book is composed.

They measure behavior, not intent. The models learn largely from engagement behavior, a strong but imperfect proxy. The serious buyer who never replies to anything will score low. Use the scores to decide where your personal attention goes first, not as an exclusion filter that writes people off.

Rare outcomes are hard, and honesty about them is the point. Conversion is uncommon in most books, which is exactly why we report ranking and calibration rather than “accuracy,” and why the fallback tier exists rather than a single number stretched over everyone.

Markets shift, so a snapshot is a snapshot. The models retrain weekly and calibration re-fits as outcomes mature, so held-out results move as markets and data change. The figures here are from the July 2, 2026 evaluation; we would rather show a dated, checkable number than an undated, permanent-looking one.

Modeled dollars are always labeled “(est.).” Anywhere a probability is turned into money — expected commission, dollars at risk, weighted forecasts — the figure is computed, not observed, and carries an “(est.)” label everywhere it appears. If we calculated it rather than counted it, we say so.

Scope note. The Seller Radar 0–100 seller-propensity score is a separate, transparent weighted-signal score — its honesty mechanism is showing you the evidence behind every score. It is not one of the five models measured in this report, and these AUC figures apply to the predictive model suite only.

8Frequently asked questions

Q.Does a 0.9654 AUC mean the model is 96.5% accurate?

No. AUC is not accuracy. It means that when you pick one contact who converted and one who didn’t, the model ranks the converter higher about 96.5% of the time. It says nothing about the share of individual predictions that are “correct” — and for rare outcomes, “accuracy” is a misleading metric anyway, which is why we report ranking quality and calibration instead.

Q.What is a held-out evaluation, and why does it matter?

A slice of labeled history is set aside before training and never shown to the model. The model then predicts on that slice, and its predictions are compared to the real recorded outcomes. Because the model never saw those examples, the result measures genuine predictive skill rather than memorization — and it is the discipline that prevents leakage, where a model is accidentally allowed to see the future and posts a number it can’t reproduce in production.

Q.How often are the models re-evaluated?

Weekly retrains with fresh held-out evaluations (a candidate that evaluates worse than the deployed model is never promoted), continuous outcome-labeling of live predictions against real 24-hour and 7-day results, calibration re-fits against those observed rates, and nightly score syncs to Follow Up Boss fields. The evaluation is a living loop, not a launch-day certificate.

Q.What does “calibrated” mean for the Ace Win Score?

The stated probability matches observed reality: contacts scored 30 convert at roughly a 30% rate over time. Predictions are labeled against real outcomes and the probability mapping is re-fit against those observed rates as markets shift — and the Model Quality view inside Trove shows the predicted-versus-actual comparison on your own account.

Q.Why publish a 0.7473 model next to ones above 0.90?

Because a real database has a large slice of thin-history contacts the strongest models can’t score confidently. Rather than leave them unscored, a full-book fallback covers that slice with the thinnest available signal. Its 0.7473 is well below the headline models and still far better than a coin flip — it ranks the eventual converter first about three times in four, for contacts that would otherwise get no read at all.

Q.Do these AUC figures apply to the Seller Radar score?

No. The Seller Radar 0–100 seller-propensity score is a separate, transparent weighted-signal score whose honesty mechanism is showing you the evidence behind every score. It is not one of the five predictive models measured here, and these AUC figures apply only to the predictive model suite.

9Glossary

AUC (area under the ROC curve)

A measure of ranking quality from 0.5 to 1.0. It equals the probability that the model ranks a randomly chosen positive example (e.g. a converter) above a randomly chosen negative one. 0.5 is a coin flip; 1.0 is a perfect ranking. AUC is not accuracy.

Held-out evaluation

Scoring a model on data set aside before training and never shown to it. Because the model never saw those examples, the result measures real predictive skill rather than memorization — and it is the primary defense against leakage.

Leakage

When a model is accidentally allowed to learn from information that wouldn’t exist at prediction time (a scrap of the future). It inflates test numbers and collapses in production. Enforcing that examples only learn from information predating their outcome is how we prevent it.

Calibration

The property that a stated probability matches the observed rate: contacts scored 30 convert about 30% of the time. A model can rank well (high AUC) yet be poorly calibrated; calibration is re-fit against real outcomes so the number itself is trustworthy.

Outcome labeling

Retaining each prediction and later attaching what actually happened — responded or not, booked or not, went cold or not, closed or not — as the real result matures. These labels feed the weekly retrain and the calibration re-fit.

Coverage tier

Which model scores a given contact, based on how much history that contact has. Rich-history contacts are read by the headline models; thin-history contacts fall to the full-book fallback so the whole database gets a read, not just the flattering part of it.

Champion / challenger guard

The promotion rule: a retrained candidate (challenger) replaces the deployed model (champion) only if it evaluates at least as well on held-out data. A candidate that quietly regressed never reaches your account.

Want the product story around these models? Read the Ace Trove launch announcement for what the suite does day to day, the predictive lead scores guide for what each field means on a contact, or the complete Ace Trove overview for the whole account-wide layer.

Scores you can audit, on your own database

Ace Win Score and Churn Risk populate free on every connected account — and Trove’s Model Quality view grades the predictions against your own outcomes.

Get Started Free →