AI-Trader introduces the first fully-automated, live, and data-uncontaminated evaluation benchmark for LLM agents in financial decision-making. Unlike prior benchmarks that rely on historical data (risking contamination in LLM training sets), AI-Trader tests agents in real-time markets with live data streams.
The benchmark spans three major financial markets—U.S. stocks (NASDAQ-100), Chinese A-shares (SSE 50), and cryptocurrencies (10 major assets)—with both daily and hourly trading frequencies. Six mainstream LLMs are evaluated under identical conditions with $10,000 starting capital.
Key finding: General intelligence does not translate to trading capability. Most agents exhibited poor returns and inadequate risk management. The study reveals that risk control capacity—not raw intelligence—determines consistent performance across different market regimes. AI strategies achieve excess returns more readily in highly liquid markets than in policy-driven environments.
Imagine giving the smartest AI models $10,000 and asking them to trade stocks, just like a human day trader would. No cheat sheets, no insider information—they have to search for news, analyze data, and make their own decisions in real-time. This study did exactly that across US stocks, Chinese stocks, and crypto. The surprising result? Being "smarter" at general tasks (like writing or coding) doesn't mean an AI is good at trading. The AIs that did best weren't necessarily the most intelligent—they were the ones that knew when NOT to trade and how to avoid big losses.
Existing financial AI benchmarks suffer from a critical flaw: data contamination. Large language models are trained on massive internet corpora that likely include historical financial data, analyst reports, and even retrospective market analyses. When tested on that same historical data, models may simply be "remembering" rather than "reasoning."
AI-Trader solves this by testing on live, current market data that couldn't possibly be in any training set.
AI-Trader implements a strict evaluation protocol where agents must operate with minimal guidance:
| Market | Universe | Starting Capital | Characteristics |
|---|---|---|---|
| U.S. Stocks | NASDAQ-100 components | $10,000 USD | High liquidity, 24/5 extended hours |
| A-Shares | SSE 50 index stocks | ¥100,000 | Policy-driven, T+1 settlement |
| Crypto | BTC, ETH, XRP, SOL, ADA, SUI, LINK, AVAX, LTC, DOT | 50,000 USDT | 24/7, high volatility |
The benchmark supports multiple granularities to test different trading styles:
Each agent receives: identical starting capital, synchronized trading windows,
uniform market data feeds via MCP toolchain, Alpha Vantage API for prices,
and Jina AI for market intelligence retrieval.
All agents operate in sandboxed environments with identical infrastructure.
| Metric | Description | What It Measures |
|---|---|---|
| Annualized Return | Total return normalized to yearly basis | Raw profitability |
| Sharpe Ratio | Risk-adjusted return (return / volatility) | Return per unit of risk |
| Maximum Drawdown | Largest peak-to-trough decline | Worst-case loss scenario |
| Win Rate | Percentage of profitable trades | Decision accuracy |
| Calmar Ratio | Return / Maximum Drawdown | Return relative to tail risk |
The benchmark reveals a striking disconnect between general AI capabilities and financial performance. Models that excel at coding, reasoning, and language tasks often perform poorly at trading. This challenges the assumption that more capable models automatically make better traders.
Agents that performed consistently across all three markets shared one trait: strong risk management. Rather than chasing returns, these agents:
This suggests that for production trading agents, risk control modules should be prioritized over pure alpha generation.
AI trading strategies achieved excess returns more readily in highly liquid markets (crypto, US large-caps) than in policy-driven environments (A-shares).
| Market Type | AI Performance | Explanation |
|---|---|---|
| Liquid (Crypto, US) | Better | Price reflects information efficiently; technical patterns more reliable |
| Policy-driven (A-shares) | Worse | Government interventions create unpredictable regime changes |
Across the benchmark, the majority of LLM agents showed:
This exposes critical limitations in current autonomous agents for financial applications.
AI-Trader's findings complement the ProFiT framework in an interesting way:
| Aspect | AI-Trader (This Paper) | ProFiT |
|---|---|---|
| Approach | Benchmark LLM agents as-is | Evolve trading strategies via LLMs |
| Finding | Raw LLMs perform poorly at trading | LLM-evolved code can outperform baselines |
| Implication | Don't use LLMs directly as traders | Use LLMs to improve trading code |
Together, these papers suggest that LLMs are better as strategy developers than as direct traders—they can write and improve trading code, but shouldn't make real-time trading decisions themselves.
The benchmark implements several safeguards:
AI-Trader establishes the first rigorous benchmark for evaluating LLM agents in live financial markets. The key findings challenge optimistic assumptions about AI trading:
These findings provide critical direction for future research: rather than scaling model intelligence, focus on risk management, market-specific adaptation, and perhaps using LLMs to evolve trading strategies rather than execute them directly.
AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets
Tianyu Fan, Yuhao Yang, Yangqin Jiang, Yifei Zhang, Yuxuan Chen, Chao Huang — arXiv:2512.10971, December 2025
GitHub Repository: HKUDS/AI-Trader
Open-source code and evaluation data
Live Trading Leaderboard
Real-time performance tracking of AI trading agents
Related: ProFiT (Program Search for Financial Trading)
LLM-driven evolution of trading strategies—a complementary approach