AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets

Executive Summary

AI-Trader introduces the first fully-automated, live, and data-uncontaminated evaluation benchmark for LLM agents in financial decision-making. Unlike prior benchmarks that rely on historical data (risking contamination in LLM training sets), AI-Trader tests agents in real-time markets with live data streams.

The benchmark spans three major financial markets—U.S. stocks (NASDAQ-100), Chinese A-shares (SSE 50), and cryptocurrencies (10 major assets)—with both daily and hourly trading frequencies. Six mainstream LLMs are evaluated under identical conditions with $10,000 starting capital.

Key finding: General intelligence does not translate to trading capability. Most agents exhibited poor returns and inadequate risk management. The study reveals that risk control capacity—not raw intelligence—determines consistent performance across different market regimes. AI strategies achieve excess returns more readily in highly liquid markets than in policy-driven environments.

ELI5: Can ChatGPT Beat the Stock Market?

Imagine giving the smartest AI models $10,000 and asking them to trade stocks, just like a human day trader would. No cheat sheets, no insider information—they have to search for news, analyze data, and make their own decisions in real-time. This study did exactly that across US stocks, Chinese stocks, and crypto. The surprising result? Being "smarter" at general tasks (like writing or coding) doesn't mean an AI is good at trading. The AIs that did best weren't necessarily the most intelligent—they were the ones that knew when NOT to trade and how to avoid big losses.

Part 1: Why We Need Live Trading Benchmarks

Existing financial AI benchmarks suffer from a critical flaw: data contamination. Large language models are trained on massive internet corpora that likely include historical financial data, analyst reports, and even retrospective market analyses. When tested on that same historical data, models may simply be "remembering" rather than "reasoning."

The Data Contamination Problem

Historical backtests are compromised: LLMs may have seen the outcomes during training
Look-ahead bias: Models trained on data from 2023 inherently "know" what happened in 2022
No true generalization test: We can't verify if models reason about markets or recall patterns

AI-Trader solves this by testing on live, current market data that couldn't possibly be in any training set.

Part 2: The AI-Trader Framework

Fully Autonomous Minimal Information Paradigm

AI-Trader implements a strict evaluation protocol where agents must operate with minimal guidance:

Zero preset strategies: No algorithmic rules or trading signals provided
Complete self-reliance: Agents must independently search, verify, and synthesize market information
Real-time operation: Decisions based on live data with automated future information filtering
Uniform conditions: All agents receive identical capital, data feeds, and execution infrastructure

Market Coverage

Trading Frequencies

Market	Universe	Starting Capital	Characteristics
U.S. Stocks	NASDAQ-100 components	$10,000 USD	High liquidity, 24/5 extended hours
A-Shares	SSE 50 index stocks	¥100,000	Policy-driven, T+1 settlement
Crypto	BTC, ETH, XRP, SOL, ADA, SUI, LINK, AVAX, LTC, DOT	50,000 USDT	24/7, high volatility

AI-Trader Architecture

Each agent receives: identical starting capital, synchronized trading windows,
uniform market data feeds via MCP toolchain, Alpha Vantage API for prices,
and Jina AI for market intelligence retrieval.

All agents operate in sandboxed environments with identical infrastructure.

Figure 1: AI-Trader benchmark architecture ensuring fair comparison across LLM agents.

Evaluation Metrics

Part 3: Key Findings

Metric	Description	What It Measures
Annualized Return	Total return normalized to yearly basis	Raw profitability
Sharpe Ratio	Risk-adjusted return (return / volatility)	Return per unit of risk
Maximum Drawdown	Largest peak-to-trough decline	Worst-case loss scenario
Win Rate	Percentage of profitable trades	Decision accuracy
Calmar Ratio	Return / Maximum Drawdown	Return relative to tail risk

Critical Finding: Intelligence ≠ Trading Ability

The benchmark reveals a striking disconnect between general AI capabilities and financial performance. Models that excel at coding, reasoning, and language tasks often perform poorly at trading. This challenges the assumption that more capable models automatically make better traders.

Finding 1: Risk Control Determines Cross-Market Robustness

Agents that performed consistently across all three markets shared one trait: strong risk management. Rather than chasing returns, these agents:

Maintained position sizing discipline
Avoided overconcentration in single assets
Recognized when NOT to trade (staying in cash)
Cut losses quickly rather than hoping for recovery

This suggests that for production trading agents, risk control modules should be prioritized over pure alpha generation.

Finding 2: Market Structure Matters

AI trading strategies achieved excess returns more readily in highly liquid markets (crypto, US large-caps) than in policy-driven environments (A-shares).

Market Type	AI Performance	Explanation
Liquid (Crypto, US)	Better	Price reflects information efficiently; technical patterns more reliable
Policy-driven (A-shares)	Worse	Government interventions create unpredictable regime changes

Finding 3: Most Agents Exhibit Poor Returns

Across the benchmark, the majority of LLM agents showed:

Negative or marginal returns in most market conditions
Inadequate risk management leading to large drawdowns
Overtrading generating excessive transaction costs
Herding behavior making similar decisions under similar prompts

This exposes critical limitations in current autonomous agents for financial applications.

Part 4: Implications for AI Trading Systems

Lessons for Building Production Trading Agents

Prioritize risk over returns: Build robust stop-loss and position sizing before optimizing for alpha
Match market to method: AI strategies work better in liquid, technically-driven markets
Test on live data: Historical backtests alone are insufficient due to contamination risk
Don't assume intelligence transfers: A model's general capability score doesn't predict trading performance
Monitor for regime changes: Strategies that work in one market may fail catastrophically in another

Connection to ProFiT and Strategy Evolution

Aspect	AI-Trader (This Paper)	ProFiT
Approach	Benchmark LLM agents as-is	Evolve trading strategies via LLMs
Finding	Raw LLMs perform poorly at trading	LLM-evolved code can outperform baselines
Implication	Don't use LLMs directly as traders	Use LLMs to improve trading code

Together, these papers suggest that LLMs are better as strategy developers than as direct traders—they can write and improve trading code, but shouldn't make real-time trading decisions themselves.

Part 5: Technical Infrastructure

Data Sources & Integration

Alpha Vantage API: Real-time and historical price data
Jina AI: Market intelligence and news retrieval
Tushare: A-share specific data (optional)
MCP Toolchain: Unified execution infrastructure preventing infrastructure-based advantages

Preventing Look-Ahead Bias

Conclusion

AI-Trader establishes the first rigorous benchmark for evaluating LLM agents in live financial markets. The key findings challenge optimistic assumptions about AI trading:

General intelligence does not transfer to trading ability
Risk control capability determines cross-market robustness
AI strategies work better in liquid markets than policy-driven ones
Most current LLM agents exhibit poor returns and inadequate risk management

These findings provide critical direction for future research: rather than scaling model intelligence, focus on risk management, market-specific adaptation, and perhaps using LLMs to evolve trading strategies rather than execute them directly.

Primary Sources

AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets
Tianyu Fan, Yuhao Yang, Yangqin Jiang, Yifei Zhang, Yuxuan Chen, Chao Huang — arXiv:2512.10971, December 2025

GitHub Repository: HKUDS/AI-Trader
Open-source code and evaluation data

Live Trading Leaderboard
Real-time performance tracking of AI trading agents

Related: ProFiT (Program Search for Financial Trading)
LLM-driven evolution of trading strategies—a complementary approach

AI-Trader: Benchmarking Autonomous Agentsin Real-Time Financial Markets