2025

Rebutly

AI-refereed debate platform

Role: Founder & engineer
Timeline: 2025
Stack: Next.js · Hono · TypeScript · Zod · DeepSeek · Perplexity Sonar · PostgreSQL · LiveKit
Links: Live

Context

The problem

Online arguments have no referee. Rebutly scores debate claims and rebuttals with an LLM — but raw LLM output is unreliable as application data: free-text scores drift, formats break, and a UI can't be built on top of prose. The hard problem wasn't prompting; it was making model output trustworthy enough to render, rank, and persist — at a unit cost that survives a free tier.

Ownership

My role

Solo build — product, scoring pipeline, realtime layer, payments, and frontend.

Solution

Approach

Schema-first LLM pipeline

Every model call is contracted by a Zod schema: scores are clamped 0–100 with defaults, responses must parse as pure JSON, and a validation failure throws — which hands the call to a job queue that retries with exponential backoff. User content is embedded in XML-structured prompts with angle brackets sanitized to fullwidth Unicode, so debate text can't inject instructions. The application layer only ever sees typed, validated scoring objects — never raw model text.

Two models, three verification tiers

DeepSeek handles scoring and claim extraction — with thinking mode switched on for reasoning calls and off for generation, because they price differently — and Perplexity Sonar fact-checks extracted claims against live web sources, returning a verdict (supported / mixed / disputed / unverifiable), confidence, and citations. Debaters choose a verification tier — tone, logic, or full fact-check — and a completed verification at a higher tier is reused instead of re-billed.

Scoring that changes incentives

Each turn gets tone, logic, and relevance scores plus detected logical fallacies — relevance explicitly penalizes statements that are factually true but dodge the opponent's argument. Ratings use a non-zero-sum ELO with split K-factors across tone, logic, and fact dimensions: both debaters can gain rating if both argue well, and both can lose it for a mud fight. Winning multiplies positive gains by 1.5× — it rewards quality first, victory second.

Cost engineering for a free tier

An entire debate is scored in one batched model call — every turn in a single request with the token budget scaled to turn count — instead of one call per turn. A daily budget circuit-breaker hard-stops AI spend at a configured cap, and every call lands in a usage ledger with per-provider token and cost breakdowns down to fractions of a cent. A full multi-model verification costs about a cent.

System

Architecture

A Turborepo monorepo: Next.js frontend, Hono API, shared Zod packages, PostgreSQL with Drizzle, and a PostgreSQL-backed job queue for verification work. Self-hosted LiveKit powers live audio debates, with data channels pushing score updates to spectators in realtime.

Zod schemas as the single contract between model and application — validation failure triggers queued retry
Batched scoring: all debate turns evaluated in one model call
Per-call AI usage ledger: provider, model, cached/reasoning/citation tokens, cost to 6 decimals
Daily AI budget circuit-breaker with in-memory caching
Live audio debates over self-hosted LiveKit with realtime score pushes
Paddle subscriptions with per-tier usage caps

Results

Outcomes

100%

of rendered scores schema-validated

1 call

scores an entire debate, every turn batched

~1¢

for a full two-model fact-checked verification

free-text model output reaching the UI

Live at rebutly.com.
The schema-first pattern generalizes to any LLM-backed product — it's the same discipline I bring to client AI work.
Cost ledger + circuit breaker means the free tier can't bankrupt the platform.

Rebutly landing page headlined Debate anything, Get a verdict, with a claim input and how-it-works steps — rebutly.com — post a claim, challenge anyone, and let AI call the verdict

Rebutly claim page for a free-will thesis with a comment thread and a verified debate in the sidebar — Claim page — open theses gather comments and one-tap challenges

Debate turn scored 73 of 100 with relevance, logic and tone bars and two detected fallacies — Every turn scored 0–100 on relevance, logic and tone — fallacies flagged inline

Debate result screen showing a 181 to 212 score, a declared winner, and split tone and logic ELO bars — Verdict — non-zero-sum ELO split across tone, logic and fact dimensions

Rebutly landing page on a phone with the Debate anything headline and claim input — On a phone — the pitch and claim input

Phone screen scoring a debate turn 82 of 100 with relevance, logic and tone bars and a flagged hasty-generalization fallacy — AI scores every turn — fallacies flagged inline

Phone screen showing a fact-check verdict marked supported with high confidence and a list of cited web sources — Fact-check — claims verified against cited web sources

Next case study

GreenCalc

Solar & wind ROI calculator

Contact

Let's build something great.

Open to Senior, Staff, and Founding Engineer roles — and select freelance engagements. Reply within 24 hours, CET timezone.

[email protected]

LinkedIn Email