Browse docsshow
Reviewer queue
When the verification agent's confidence is below your threshold, the task moves to human_needed and shows up in your reviewer queue at review.tobeverified.com. A human picks a verdict; your verify() call returns with the result.
Two modes
| Mode | Who sees it | Tasks shown |
|---|---|---|
| demo | Anyone (no sign-in) | A curated set of seeded examples (refund / fraud / PII / etc.). Verdicts go to a separate demo_verdicts table; never affect any real account. |
| authed | Signed in as the account owner | Only your account's human_needed tasks. Verdicts attribute via reviewer_user_id. |
What the reviewer sees
For each task in the queue, the reviewer page shows:
- · The original prompt
- · The full context (JSON-rendered)
- · The agent's first-pass verdict + confidence + rationale
- · A button per allowed verdict, plus an optional rationale field
After a verdict is submitted
- 1. The task transitions to
completedwithfinal_tier='human'andfinal_verdictset to the chosen value. - 2. A new
attemptrow is recorded with the reviewer's user id, the verdict, and the rationale. - 3. The next poll from the SDK returns the completed task.
- 4. The verdict feeds the calibration loop (planned: per-account threshold auto-tune).
Default reviewer pool
The default reviewer is you (and anyone else signed in to your account). Useful for evaluation, demos, and small-scale development. For production-scale review, route to a sink (GitProduct, Fiverr, your own Slack, an internal team) — see Task lifecycle for sink callbacks.
Demo mode
Curious how the reviewer experience works without sending any real tasks? Open it in incognito — the demo queue has 6 illustrative examples (refund classification, PII leak, fraud screen, content moderation, intent routing, tool-call validation). Grade them; verdicts persist as eval signal but don't affect any real account.
Threshold tuning
Your confidence_threshold controls how aggressively tasks route to humans. Tighter threshold → more human review (higher quality, higher cost / slower). Loose threshold → more agent autonomy (lower cost / faster, more risk).
Recommended starting point: 0.85. Track the human-rate at app.tobeverified.com and tune from there. Per-account auto-tuning is on the roadmap.