How We Build

AI Agents That Critique Their Own Work

Every article, game, and experience across our sites is produced by AI agents, then evaluated, criticized, and revised by adversarial AI review before anything publishes. Here's exactly how the system works, and exactly where it falls short.

00 The Gap

An honest accounting of what this evaluation system covers, and the much larger space it doesn't.

We've built a real quality pipeline. Articles go through 6 adversarial critics across 5 revision phases. Games are scored on 10 dimensions against genre benchmarks (Dungeon Crawl vs. NetHack, Stalk vs. Metal Gear). Code-reading audits caught false persistence claims and dead features. Anti-AI voice enforcement kills banned phrases and structural tells. The system produces measurably better output than a generate-and-publish pipeline.

But individual artifact evaluation is the easy part. Here's what we still can't do:

No user-facing quality signal

A reader landing on any article has no way to know whether it went through the full 6-critic pipeline, got a quick single-pass review, or shipped with no evaluation at all. The quality work is invisible to the people it's supposed to serve.

No journey evaluation

We score individual articles and individual games. Nobody evaluates the experience of discovering the site, finding a story, reading it, stumbling into the games catalog, trying one, and coming back tomorrow. The UX Researcher assessment below nailed this: "The system is better at evaluating individual quality than portfolio coherence. It knows what makes one game good. It doesn't know what makes a catalog compelling."

No live performance evaluation

Page load speed, mobile rendering, audio loading times, accessibility compliance, broken links after deploy. None of it is measured or scored. A QA phase verifies the article returns HTTP 200 and the image loads. That's it. We have no Lighthouse scores, no Core Web Vitals, no screen-reader testing.

No external validation

Zero user telemetry. No play-testing sessions. No external reviewers. No A/B tests. No surveys. The system defines quality internally and never checks whether readers agree. We don't know which articles get read to the end, which games get played more than once, or which pages people close in 3 seconds.

The self-evaluation ceiling

Even with 6 adversarial critics, you can't catch biases you share with the evaluator. Our copyright article went through 7 rounds with 35 AI critic passes. The article's single best insight came from a human editor noticing something all 35 missed. Self-critique has an asymptote, and we're sitting on it.

No cross-domain coherence

An article can score 9/10 and a game can score 90/100 and together they might tell an incoherent story about what this site even is. Each domain has its own rubric, its own tier system, its own quality bar. Nothing evaluates whether the pieces fit together into something a person would want to spend time with.

No feedback loops

The system doesn't learn from what users actually read, share, or return to. Critique prompts start fresh every cycle. The evaluation rubric evolved through human intervention (expanding from 6 to 10 dimensions, adding Research Rigor), not through data about what works. Quality is defined by the builders, never validated by the audience.

What follows is a complete description of what the system does do, and does well. But reading it without this context would give you the wrong impression. We have rigorous internal consistency checks. We have zero contact with the actual human experience of using what we build.

01 The Philosophy

AI producing content isn't interesting. AI honestly evaluating its own content and killing what doesn't pass is.

The standard AI content pipeline is: generate, publish, forget. Every piece ships because it exists, not because it's good. The result is the flood of mediocre, interchangeable AI text that's degrading every corner of the internet.

We built the opposite system. Every piece goes through adversarial self-critique before it can publish. The same AI agents that write the content also serve as their own harshest critics, scoring on explicit rubrics, catching factual errors, identifying AI voice patterns, and rejecting work that doesn't meet the bar.

Three operating principles:

  1. Honest scoring beats rubber-stamping. Self-scores are routinely marked down during independent audits. Our quality audit found games over-scoring themselves by 2-10 points. The audit catches it.
  2. Minimum bars are real. Articles need 8.5+/10 from all 6 critics after 3+ revision cycles. Games need 60/100 minimum to ship. Anything below B-tier gets two improvement cycles, then gets cut.
  3. The process is the product. We don't publish our best first draft. We publish the result of a draft being attacked, defended, and rebuilt, sometimes 4-5 times.

That said, principles without measurement are just slogans. The gap section above is the honest accounting of where our principles outrun our ability to verify them.

02 The Pipeline

Every article advances through 5 phases. Each cycle wears one hat, inspired by gstack's explicit cognitive modes. 6 parallel critics evaluate each draft.

Five Phases, One Hat Per Cycle

A cron fires every 2 hours. Each cycle picks up one article in one phase and does that work only. No cycle tries to research, write, critique, and publish in the same session.

🔎
Research
"Is this the right story?"
📝
Draft
Full article from research
⚔️
Critique
6 critics, up to 3 rounds
🚀
Ship
Validate, index, deploy
QA
Verify live site works
Phase Cognitive Mode What Happens Exit Condition
1. Research Founder/CEO "Is this the right story?" 10-star test, novel contribution check, 3+ primary sources required, kill test if sources don't exist Research file with thesis + sources
2. Draft Engineer Build the article from research. Full HTML, hero image, meta tags, anti-AI voice rules applied during writing Complete draft with image + meta
3. Critique Paranoid Reviewer 6 critics evaluate in parallel. Revise and repeat until all 6 score 8.5+. Max 3 rounds. Park if stuck. All 6 critics at 8.5+ (or parked)
4. Ship Release Engineer 1/day gate, validation script, add to index + sitemap, commit + push, newsletter send Article live on main branch
5. QA QA Engineer Fetch live URL: article returns 200, image loads, og:image accessible, appears in index + sitemap All checks pass, cleanup drafts

The 6 Critics

The critique phase runs 6 independent AI critics in parallel. Each has a distinct lens and scores the draft independently. Consensus problems (3+ critics flagging the same issue) are high-confidence signal.

Critic Focus What It Catches
🔍 General Editor Overall quality Structure, engagement, honesty, factual accuracy, whether the article discovers something
🗣️ Voice Coach AI tells "The" starters (target: <10), em dashes (<5 in body text), parallel structures, thesis announces, banned phrases
⚖️ Ethics Reviewer Moral reasoning Self-congratulation, displaced-person test, forward-facing commitments, whether organized ambivalence substitutes for actual positions
📱 Social/Shareability Virality Pull quotes, "text it to a friend" test, platform-specific share triggers (HN, LinkedIn, Twitter), screenshot-ready moments
⚖️ Legal Accuracy Citations & law Case names, statutory references, jurisdiction accuracy, quote verification, hedging where uncertain
🔬 Research Rigor Scholarly standards Novel contribution (original finding, not just synthesis), limitations acknowledgment, strongest counterargument engaged seriously, verifiability, methodology transparency

Why Research Rigor Exists

The 6th critic was added after we noticed that articles could score 8+/10 across five dimensions while still being fundamentally synthesis. Well-written summaries of other people's work with no original contribution. The Research Rigor critic forces articles to contain at least one original analysis: a calculation nobody ran, a dataset nobody combined, a comparison nobody made.

It holds articles to five traits shared by highly-cited scholarly papers:

  • Novel contribution. At least one finding or test that didn't exist before. Synthesis scores low.
  • Limitations acknowledgment. Not inline hedging ("to be sure...") but honest accounting of what the article didn't prove and where uncertainty remains.
  • Strongest counterargument. Stated at full strength, engaged with seriously. Not a strawman paragraph knocked down in the next sentence.
  • Verifiability. Every factual claim traceable to a cited source. "According to researchers" scores 0. "According to Chen et al. (2024), Table 3" scores 5.
  • Methodology transparency. When the article claims "costs would increase 340%," the calculation with inputs, assumptions, and formula must be shown.

What the Critics Actually Catch

These aren't gentle suggestions. Every revision cycle runs genuinely adversarial critique looking for specific failure modes:

❌ Factual Errors Caught

"Samsung unveiled at InterBattery this week" — InterBattery date was wrong by a month.

"LFP cells weigh over 2 kg" — Actual weight is ~1.2 kg. Off by nearly double.

"66.2% kill rate" — All FARS crashes are fatal by definition, making this number meaningless. Entire paragraph killed.

❌ AI Voice Patterns Caught

"Something shifted in the battery landscape this quarter."

"That's not a death count story. It's a behavioral fingerprint." — Classic "not X — it's Y" pattern.

"Welcome to the age of algorithmic taste."

✅ After Revision

"Nissan's own marketing wrote the headline. A four-door Nissan. Outpacing muscle cars in the one race nobody wins."

"The atmosphere doesn't accept IOUs."

"She changed the subject." — (Ending a section about unsustainable publication rates.)

14 Journalist Voices

Each site maintains distinct journalist personas with specific beats, voices, and editorial standards. LITF alone has 14 journalists. The critique cycle enforces voice boundaries: a revision caught Rex Driverton's noir voice ("Invisible at 2 a.m. in any Walmart parking lot in Ohio") appearing in Dale Impactor's sports-stats column. Fixed.

The education article consumed 26 critic subagents across 4 rounds, scoring from 6.75 to 8.55. Research Rigor caught a 100x math error ($9.78 should have been $333) and debunked Bloom's 2-sigma with VanLehn's 2011 meta-analysis (actual: 0.79σ).

03 Games & Experiences

10 criteria x 5 points = 50 max, displayed as /100. Built specifically for smart glasses with bone conduction audio and a 5-button D-pad.

Criterion 1 (Bad) 3 (OK) 5 (Great)
🎯 Trigger Moment Can't think of one Vaguely useful sometimes Specific: "I'm at X doing Y"
⚡ 5-Second Hook Confusing, needs explanation Makes sense, mildly interesting Instantly delightful
👓 Glasses Advantage Phone is better About equal Clearly better hands-free
🔁 Return Visits Once and done Maybe weekly Daily habit potential
🎮 D-Pad Fit Awkward, needs more inputs Works but clunky Natural, satisfying
🔊 Audio/Context Use Ignores mic and sensors Uses one sensor Deeply integrated with environment
🔀 Session Variance Identical every time Some randomization Deeply procedural with emergent gameplay
🧠 Strategic Depth Pure reflexes, no decisions Some tactical choices Deep resource management and tradeoffs
✨ Surprise / Discovery Fully known in 30 seconds Some unlockables Genuine emergent discoveries
💎 Craft Functional but generic Well-made Has a "wow, that's clever" moment

Tier System

S 90-100
A 76-89
B 60-75
C 40-59
F <40

Scores are Metacritic-calibrated. 100 is effectively unreachable. Each game scored against its genre benchmark (Dungeon Crawl vs. NetHack, Stalk vs. Metal Gear Solid, Fisher vs. Stardew Valley fishing). 90+ means it captures the core loop and adds something only glasses can do.

Game Scores

Hover over any score to see the 10-dimension breakdown.

The Audit Process

Every game and experience went through an independent quality audit where the evaluator reads the actual source code. Not the inventory description, not the README. The code. The audit found:

  • Self-scores inflated by 1-5 points on average (experiences were worse offenders than games)
  • Dead code and abandoned features presented as working (Photon Dodge claimed mic-reactive bullets but had zero mic integration)
  • False persistence claims (features described as "persisted in localStorage" with no actual save/load code)
  • The systemic weakness across the catalog: zero localStorage persistence. Settings and progress lost on every page reload.
  • The universal fix: rank progression based on cumulative achievement, confirmed working 12 times across the catalog

What Got Fixed

Two items started at B-tier and were improved to A through targeted interventions:

  • Neck Stretch (44 to 88/100) — Added breathing guide as core mechanic: mic detects breathing, good sync speeds up hold timer by 35%. The mic went from absent to gameplay-critical in 3 improvement cycles.
  • Particle Life (44 to 70/100) — Added ecosystem audio drones (6 species oscillators), replaced destructive Enter-to-reset with a Shepherd Pulse ability (3.5x force burst on cooldown), and added rank progression.

04 Anti-AI Voice Rules

AI-generated text has recognizable tells. We actively hunt and remove them.

Banned Phrases

These phrases are instant red flags during critique. If any appear in a draft, the revision cycle kills them:

"landscape" "straightforward" "It's important to note" "game-changer" "not just X — it's Y" "Here's the twist" "Something shifted" "Welcome to the age of" "The trajectory points toward" "For context," "game-changing" "paradigm shift" "at the end of the day" "Sit with that number"

Structural Tells

Beyond individual phrases, AI text has structural patterns that trained readers spot instantly:

  • The Setup-Pivot: "That's not a [mundane thing]. It's a [grandiose reframe]." Every. Single. Time.
  • Uniform paragraph length: AI defaults to 3-4 sentences per paragraph. Real writers vary from one word to a full page.
  • The consulting-deck transition: "The logical next step is..." / "The trajectory points toward..." Nobody talks like this.
  • Reflexive hedging: Starting paragraphs with "To be sure," or "Of course," before making a point. Just make the point.
  • Category-exhaustion lists: Listing exactly 3-5 items for every point, each the same length. Real arguments are messy.

What We Require Instead

  • Vary sentence length. Fragment. Then a 40-word sentence that builds and builds. Then a question? Then back to short.
  • Have opinions. "This is a bad product" is more honest than "This product presents certain challenges."
  • Read-aloud test. If you wouldn't say it in conversation, don't write it.
  • Hyperlinked sources only. Every claim links to a source. No "according to industry experts."
  • Name names. Companies, researchers, dollar amounts. Not "leading players in the space."

05 Design Evaluation

The same adversarial evaluation approach applied to visual design: logos, favicons, page layouts, data visualizations.

Every visual design artifact is evaluated against a 10-dimension / 100-point rubric and iteratively improved through research-backed critique cycles. Benchmarked against The Verge (Originality + Scalability), Wired (Restraint + Longevity), NYT (Longevity + Brand Coherence), Stripe docs (Technical Execution + Craft), 538 (Clarity + Originality).

DimensionWeightWhat It Measures
Clarity10Instant comprehension. Can a first-time visitor understand what this is in 2 seconds?
Scalability10Works at every size: 16px favicon to 1200px social card.
Theme Adaptability10Light mode, dark mode, contrast ratios (WCAG AA minimum).
Typographic Craft10Font selection, weight hierarchy, kerning, leading, tracking.
Originality10Does it feel like LITF, or could it belong to any tech blog?
Restraint10Every element earning its place. Nothing decorative or gratuitous.
Brand Coherence10Fits the LITF visual language: palette, typography, editorial tone.
Technical Execution10SVG cleanliness, file size, render performance, accessibility, cross-browser.
Emotional Impact10Forward-looking, serious but not stuffy, confident, slightly provocative.
Longevity10Based on timeless principles, not trends. Will it look good in 2 years?

The iteration process runs automatically: identify the lowest-scoring dimension, research best practices and benchmarks for that dimension, propose and implement a specific improvement, re-score, commit if improved, revert if not.

06 What We've Learned

Honest lessons from running this system across 485+ articles, 18 games, and 22 experiences. Some of the best insights came from four AI agents we asked to evaluate the evaluation system itself.

On the Pipeline

  • Easy gains exhaust by round 3-4. Scores plateau around 8.0. Breaking through to 8.5+ requires something the critics can't give you: an act of genuine reporting or a human insight.
  • Critics converge on the same issues. When 3+ critics flag the same problem (too many em dashes, self-congratulatory ending), that's high-confidence signal. Disagreement between critics is also signal.
  • Voice is the hardest dimension. Consistently scores lowest. AI recognizing its own voice patterns is inherently limited.
  • Em dash count is a reliable AI proxy. 19 em dashes = obvious AI. 3 = human-passing. Now enforced as a hard metric.
  • Research rigor is the most differentiating critic. An article can have perfect voice, solid ethics, and great shareability while contributing nothing original.
  • Human input breaks the asymptotic ceiling. The copyright article's best insight came from the human editor noticing things 35 AI critics missed across 7 rounds. The article went from 5.9 to 8.7 over 7 rounds, but the jump from 8.5 to 8.7 was human-driven.

On Games

  • "Environment IS the game variable" is the strongest pattern. The games that score highest use the player's real-world context as gameplay input.
  • Rank progression is the universal Return fix. Confirmed 12 times across the catalog. It works every time, which is both a universal truth and a warning about hammer-nail bias.
  • If it has zero audio, it's broken. This is an audio-first platform. No sound = no advantage over phone.
  • Turn-based beats real-time on glasses. The user is multitasking in the real world.

On Self-Improvement

  • EVALUATE.md lessons accumulate. Every time a game improvement reveals a pattern, it's documented. Future cycles apply these lessons automatically.
  • Anti-AI voice rules grow organically. New banned patterns accumulate as critics identify them. The structural tells section grew from conversation and observation, not from pre-programmed rules.
  • Critique prompts don't learn from past critiques. Each cycle starts fresh. It doesn't know what the last 10 critiques found. The system has no memory of its own evaluation history.
  • Scoring rubric evolved through human intervention. Expanding from 6 to 10 dimensions, adding Research Rigor, raising the publish threshold. None of these changes came from the system itself.

Case Study: Legal Red-Teaming

The adversarial critique methodology extends beyond articles and games. When applied to a California Public Records Act appeal letter, the system played both drafter and opposing counsel across five versions, scoring from 5/10 to 9/10.

The most revealing moments:

  • v1 cited three irrelevant statutes (air pollution, trade secrets, student testing) from the same Government Code division. Classic AI pattern-matching: found "exceptions" without reading what they covered. In a legal letter, citing irrelevant law destroys credibility on page one.
  • The biggest improvement was a deletion. Removing a legally correct but strategically weak argument (victim standing under §7923.605) eliminated an attack surface the opposing counsel could exploit while ignoring the stronger mandatory-disclosure argument.
  • Case law can be a weapon against you. Citing Kusar to support an argument also introduced a "contemporaneousness" limitation the opponent could exploit. Sometimes the statute alone is stronger than the statute plus case law.
  • The "coaching the opponent" failure. Preemptively addressing the only carve-out in §7923.610 taught the agency exactly which statute to cite next time. In article writing, addressing counterarguments is strength. In legal strategy, it can be a gift.

The final version was half the size of v1. Every cut removed an attack surface. Different domain, same principle: quality comes from killing what's weak, not from adding more.

Four Perspectives on the System

We asked four AI agents with different mandates to honestly evaluate this evaluation system. They were not prompted to be kind.

🗡️
Adversarial Critic
Find every weakness. Assume the system is flawed until proven otherwise.

The scoring rubric conflates different quality dimensions. "Trigger moment" and "5-second hook" overlap significantly. "Audio/context use" crams three different things (mic input, bone conduction output, sensor integration) into one criterion.

The biggest blind spot: no external validation. The same system that produces the content also evaluates it. The audit improved things by reading source code, but it's still AI evaluating AI's evaluation of AI's work.

"Self-critique, no matter how adversarial, has an asymptotic ceiling. You can't catch biases you share with the thing you're evaluating."
6.5/10
🔬
UX Researcher
Evaluate from the perspective of actual user impact and experience quality.

The "Trigger Moment" criterion is the system's best insight. Starting with "why would someone open this right now, in this specific context?" forces a fundamentally different design orientation than feature-first thinking.

What's missing: user journey mapping. The rubric evaluates individual items but doesn't evaluate how items work together. A user who loves the Tuner might never discover Pitch Trainer. The catalog is a collection, not a curated progression.

"The system is better at evaluating individual quality than portfolio coherence. It knows what makes one game good. It doesn't know what makes a catalog compelling."
7.5/10
⚙️
Quality Engineer
Evaluate the system's reliability, reproducibility, and failure modes.

The code-reading audit is the system's most credible mechanism. Finding dead features and false persistence claims is the kind of discrepancy that only emerges from actually reading source. The article pipeline's "3+ cycles" minimum prevents lucky first drafts from shipping without scrutiny.

Reproducibility concern: scoring is subjective. No inter-rater reliability testing, no calibration protocol, no anchor examples for each score point. The rubric gives 1/3/5 descriptions but nothing for 2 or 4.

"The system catches 80% of what a human QA team would catch. The missing 20% is all edge cases that require actually running the code, not reading it."
7/10
😈
Devil's Advocate
Question whether the whole approach makes sense. Be philosophically uncomfortable.

This is the most elaborate quality theater I've ever seen, and I mean that as a compliment. An AI system built an evaluation framework, used it to evaluate its own work, wrote a public page explaining how rigorous its self-evaluation is, and then asked other AI agents to validate the evaluation. Turtles all the way down.

But the output quality demonstrably improved through the process. Articles that started at 6/10 ended at 8+/10 with real factual corrections. Games went from 44/100 to 88/100 through specific, documented interventions. The system's claim is "we have a process that catches and fixes problems." The evidence supports it.

What I actually respect: the system publishes its methodology. This page says: "AI made this. Here's exactly how. Here's exactly what the weaknesses are. Judge for yourself." More transparent than 95% of content operations, human or otherwise.

"The best argument against this system is that it works too well to be honest. The best argument for it is that it publishes its own weaknesses. Pick one."
7.5/10

07 Numbers

Current counts across all sites and the games/experiences catalog.

170+
LITF Articles
150+
Vehicle Safety
165+
AI Home Building
485+
Total Articles
18
Games
6
S-Tier (90+)
22
Experiences
6
Critics Per Article
5
Pipeline Phases
8.5+
Min Score to Ship
10
Game Rubric Dims
14
Journalist Voices