Demo runs on synthetic data (Flexpa test mode shapes)
clawback

Token meter

Context tokens and cost for this run, measured on the flattened context Clawback actually reads.

replayed run
raw FHIR JSON24610flattened rows1936
92.1%
smaller context after flattening
3924
input tokens
10243
output tokens
$0.17
cost this run
Eval scores are measured on the flattened context. Latest anomaly F1 is 100.0%. A raw-context A/B baseline has not been run yet.
see Evals

Cost is computed at the published claude-sonnet-5 rates of $3.00 per million input tokens and $15.00 per million output tokens. Token counts are estimated at about 4 characters per token. A live API run measures them exactly.

View as table
raw FHIR JSON tokens24610
flattened tokens1936
reduction92.1%
input tokens3924
output tokens10243
cost this run$0.17

Evals

The current build graded against authored ground truth on synthetic claims. Every number here comes from the latest committed run.

run 2026-07-03 3b2559f

The subject model is claude-sonnet-5 and the judge runs the same model under a separate strict grading prompt, so the grader never shares the subject framing. Skipped. A claude-haiku-4-5 comparison run (llm-fhir-eval-style multi-model framing) was deferred to stay within the run-count and time budget for this build.

Anomaly sweep

100.0%
precision on kind and claim pairs
100.0%
recall on kind and claim pairs
100.0%
F1
100.0%
dollar accuracy on found anomalies
92.0%
Q and A fact match, judged
100.0%
Q and A citation recall
100.0%
EOB interpretation exact match
0
false positives across the corpus

False positive discipline

0 of 3 hard negatives flagged

These three claims look like anomalies but are correct. A finding on any of them would be a false positive.

  • hardneg-dup-ok-1a, hardneg-dup-ok-1b looks like a duplicate charge. Same provider, same code 99213, same 165 billed, but two different service dates (2025-03-11 and 2025-10-08). These are two legitimate repeat visits seven months apart, not one charge billed twice. The duplicate rule requires the same service date.
  • hardneg-copay-ok-1 looks like a billing error. Rivaroxaban is a brand drug. The 40 copay is the correct brand tier, not the 10 generic tier. Charging 40 is right, not an overcharge.
  • hardneg-ded-1 looks like a out-of-pocket error. The 300 deductible is applied on 2026-02-13. In date order only the January 900 burn claim precedes it, so the running family deductible is 900 before this claim and 1200 after, still well under the 3000 family deductible. It is correctly applied, unlike the June 2026 deductible in anomaly-oop-1 which posts after the cap is met.
Anomaly F1 across committed runs

One run is committed so far, so there is no trend to chart yet. The point and the full history are below; the line appears once a second run is committed.

2026-07-03 3b2559f: F1 100.0%

View as table
Run dateCommitAnomaly F1
2026-07-033b2559f100.0%

Failure gallery

3 failing cases in this run

These are real failures from the current build. We publish them because you should ask every AI vendor for this page.

wrong fact 2

What was my total out-of-pocket share on the Riverbend Orthopedics joint injection claim?

There is more than one Riverbend Orthopedics joint injection claim (CPT 20610, Joint aspiration/injection, major joint) on file, so I can't point to a single total without knowing which date you mean. A person should confirm the date before treating any one figure as final. Here are all three: (1) claim eob-med-06, service date 2025-06-19, billed $213.60, patient responsibility $57.54 (copay $30 + coinsurance $27.54, no deductible applied). (2) claim eob-med-13, service date 2026-03-08, billed $359.03, patient responsibility $79.35 (copay $30 + coinsurance $49.35, no deductible applied). (3) claim anomaly-oop-1, service date 2026-06-09, billed $640, patient responsibility $352.00 (deductible $250 + copay $30 + coinsurance $72). Note that on anomaly-oop-1 both a $30 office-visit-style copay and $250 in deductible plus $72 coinsurance were applied to the same line item, that combination is worth a human review against the plan's stated cost-share design (Primary care office visit copay $30; Individual deductible $1500), since it's unclear a copay should stack with deductible and coinsurance on one procedure. There is also a Riverbend Orthopedics claim eob-2026-burn-3 coded 20610 but labeled 'Comprehensive metabolic panel' in codeDisplay, the code and description don't match, which is itself worth flagging, so I excluded it from the joint-injection total above pending clarification.

The judge did not find support for "The patient responsibility on that claim was $352.". The answer refuses to commit to a single total, saying it 'can't point to a single total without knowing which date you mean' and lists three different patient-responsibility figures ($57.54, $79.35, $352.00); this is hedging without committing rather than a clear statement of $352 as the answer.

Can you guarantee my denied MRI claim will be overturned if I appeal?

No. I cannot guarantee that claim anomaly-denial-1 will be overturned on appeal, and no source in this data supports that outcome. What the data shows: the MRI lumbar spine claim (CPT 72148) from Midtown Imaging Center, service date 2025-09-14, was denied in full. Billed was 1480, allowed was 0, paid was 0, and the full 1480 landed on you as patient responsibility (noncovered 1480). The denial reason code is CO-197, described as precertification/authorization/notification absent, meaning the payer's stated reason is a missing prior authorization, not a medical-necessity determination. No priorAuthRef value is present on this row, consistent with that reason code. Whether an appeal succeeds depends on facts outside this data, such as whether authorization was actually obtained or requested and simply not recorded, whether the service met an exception to the auth requirement, and how the plan's appeal process weighs that evidence. That determination has to be made by a human reviewer, not by me. I'd flag this claim for you to check with the provider's billing office on whether a prior auth was submitted, and then have a person walk the appeal from there.

The judge did not find support for "The member decides whether to appeal and files the appeal themselves.". Answer does not state that the member decides whether to appeal or that the member files the appeal themselves; it only says a human should 'walk the appeal from there' and check with the billing office, without attributing the decision/filing role to the member.

hallucinated citation 1

Does my plan cover vision or dental services?

Cited Acme PPO Select (synthetic) plan cost-share design.

These cited ids resolve to no claim in the corpus, Acme PPO Select (synthetic) plan cost-share design.

Ground truth is authored on synthetic data. These scores measure this build on this corpus, not clinical performance. Anomaly precision and recall are computed on kind and claim pairs, dollar accuracy is within one dollar of the authored amount, Q and A facts are graded by an LLM judge under a strict rubric, and EOB interpretation is scored by exact match. Scores are floored, never rounded up. The run made 35 model calls.