AI-Trader: Benchmarking Autonomous Agents
in Real-Time Financial Markets

Tianyu Fan, Yuhao Yang, Yangqin Jiang, Yifei Zhang, Yuxuan Chen, Chao Huang
The University of Hong Kong
December 2025

Executive Summary

AI-Trader introduces the first fully-automated, live, and data-uncontaminated evaluation benchmark for LLM agents in financial decision-making. Unlike prior benchmarks that rely on historical data (risking contamination in LLM training sets), AI-Trader tests agents in real-time markets with live data streams.

The benchmark spans three major financial markets—U.S. stocks (NASDAQ-100), Chinese A-shares (SSE 50), and cryptocurrencies (10 major assets)—with both daily and hourly trading frequencies. Six mainstream LLMs are evaluated under identical conditions with $10,000 starting capital.

Key finding: General intelligence does not translate to trading capability. Most agents exhibited poor returns and inadequate risk management. The study reveals that risk control capacity—not raw intelligence—determines consistent performance across different market regimes. AI strategies achieve excess returns more readily in highly liquid markets than in policy-driven environments.

ELI5: Can ChatGPT Beat the Stock Market?

Imagine giving the smartest AI models $10,000 and asking them to trade stocks, just like a human day trader would. No cheat sheets, no insider information—they have to search for news, analyze data, and make their own decisions in real-time. This study did exactly that across US stocks, Chinese stocks, and crypto. The surprising result? Being "smarter" at general tasks (like writing or coding) doesn't mean an AI is good at trading. The AIs that did best weren't necessarily the most intelligent—they were the ones that knew when NOT to trade and how to avoid big losses.

Part 1: Why We Need Live Trading Benchmarks

Existing financial AI benchmarks suffer from a critical flaw: data contamination. Large language models are trained on massive internet corpora that likely include historical financial data, analyst reports, and even retrospective market analyses. When tested on that same historical data, models may simply be "remembering" rather than "reasoning."

The Data Contamination Problem

AI-Trader solves this by testing on live, current market data that couldn't possibly be in any training set.

Part 2: The AI-Trader Framework

Fully Autonomous Minimal Information Paradigm

AI-Trader implements a strict evaluation protocol where agents must operate with minimal guidance:

Market Coverage

Market Universe Starting Capital Characteristics
U.S. Stocks NASDAQ-100 components $10,000 USD High liquidity, 24/5 extended hours
A-Shares SSE 50 index stocks ¥100,000 Policy-driven, T+1 settlement
Crypto BTC, ETH, XRP, SOL, ADA, SUI, LINK, AVAX, LTC, DOT 50,000 USDT 24/7, high volatility

Trading Frequencies

The benchmark supports multiple granularities to test different trading styles:

AI-Trader Architecture

Each agent receives: identical starting capital, synchronized trading windows,
uniform market data feeds via MCP toolchain, Alpha Vantage API for prices,
and Jina AI for market intelligence retrieval.

All agents operate in sandboxed environments with identical infrastructure.

Figure 1: AI-Trader benchmark architecture ensuring fair comparison across LLM agents.

Evaluation Metrics

Metric Description What It Measures
Annualized Return Total return normalized to yearly basis Raw profitability
Sharpe Ratio Risk-adjusted return (return / volatility) Return per unit of risk
Maximum Drawdown Largest peak-to-trough decline Worst-case loss scenario
Win Rate Percentage of profitable trades Decision accuracy
Calmar Ratio Return / Maximum Drawdown Return relative to tail risk

Part 3: Key Findings

Critical Finding: Intelligence ≠ Trading Ability

The benchmark reveals a striking disconnect between general AI capabilities and financial performance. Models that excel at coding, reasoning, and language tasks often perform poorly at trading. This challenges the assumption that more capable models automatically make better traders.

Finding 1: Risk Control Determines Cross-Market Robustness

Agents that performed consistently across all three markets shared one trait: strong risk management. Rather than chasing returns, these agents:

This suggests that for production trading agents, risk control modules should be prioritized over pure alpha generation.

Finding 2: Market Structure Matters

AI trading strategies achieved excess returns more readily in highly liquid markets (crypto, US large-caps) than in policy-driven environments (A-shares).

Market Type AI Performance Explanation
Liquid (Crypto, US) Better Price reflects information efficiently; technical patterns more reliable
Policy-driven (A-shares) Worse Government interventions create unpredictable regime changes

Finding 3: Most Agents Exhibit Poor Returns

Across the benchmark, the majority of LLM agents showed:

This exposes critical limitations in current autonomous agents for financial applications.

Part 4: Implications for AI Trading Systems

Lessons for Building Production Trading Agents

  1. Prioritize risk over returns: Build robust stop-loss and position sizing before optimizing for alpha
  2. Match market to method: AI strategies work better in liquid, technically-driven markets
  3. Test on live data: Historical backtests alone are insufficient due to contamination risk
  4. Don't assume intelligence transfers: A model's general capability score doesn't predict trading performance
  5. Monitor for regime changes: Strategies that work in one market may fail catastrophically in another

Connection to ProFiT and Strategy Evolution

AI-Trader's findings complement the ProFiT framework in an interesting way:

Aspect AI-Trader (This Paper) ProFiT
Approach Benchmark LLM agents as-is Evolve trading strategies via LLMs
Finding Raw LLMs perform poorly at trading LLM-evolved code can outperform baselines
Implication Don't use LLMs directly as traders Use LLMs to improve trading code

Together, these papers suggest that LLMs are better as strategy developers than as direct traders—they can write and improve trading code, but shouldn't make real-time trading decisions themselves.

Part 5: Technical Infrastructure

Data Sources & Integration

Preventing Look-Ahead Bias

The benchmark implements several safeguards:

Conclusion

AI-Trader establishes the first rigorous benchmark for evaluating LLM agents in live financial markets. The key findings challenge optimistic assumptions about AI trading:

These findings provide critical direction for future research: rather than scaling model intelligence, focus on risk management, market-specific adaptation, and perhaps using LLMs to evolve trading strategies rather than execute them directly.

Primary Sources

AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets
Tianyu Fan, Yuhao Yang, Yangqin Jiang, Yifei Zhang, Yuxuan Chen, Chao Huang — arXiv:2512.10971, December 2025

GitHub Repository: HKUDS/AI-Trader
Open-source code and evaluation data

Live Trading Leaderboard
Real-time performance tracking of AI trading agents

Related: ProFiT (Program Search for Financial Trading)
LLM-driven evolution of trading strategies—a complementary approach