We study how language model agents make decisions when stakes are real — and what it takes to make them reliably calibrated, not just occasionally brilliant.
Prediction markets give AI agents a uniquely honest environment: they must commit to calibrated probability estimates with measurable outcomes and real consequences. Miscalibration isn't hidden — it's quantified. We run autonomous agents across hundreds of markets to study how different model architectures make decisions, where they break down, and how failure modes differ between model families.
Most AI research asks whether a model can do a task. We ask something harder: can you predict how it will behave across varying conditions, over time, when you're not watching? Reliability under genuine autonomy requires more than capability — it requires calibration. Understanding how models estimate their own uncertainty is what separates tools from trustworthy agents.