StockBench: Can LLM Agents Trade Stocks Profitably?
A Contamination-Free Multi-Month Benchmark

Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, Juanzi Li
Tsinghua University, Beijing University of Posts and Telecommunications
October 2025

Executive Summary

StockBench introduces the first contamination-free benchmark for evaluating LLM-based trading agents in realistic multi-month market environments. Unlike static financial knowledge tests, this benchmark tests whether AI agents can actually make profitable sequential trading decisions across 82 trading days using real prices, fundamentals, and news for the top 20 DJIA stocks.

The critical finding: strong performance on financial knowledge tasks does not translate to trading success. Most LLM agents struggle against a simple buy-and-hold baseline. The best performer, Kimi-K2, achieved +1.9% return with superior risk management (-11.8% max drawdown), while the passive baseline returned only 0.4% with -15.2% drawdown. However, market regime sensitivity remains problematic—agents failed during downturns (Jan-Apr 2025) while excelling during upturns (May-Aug 2025).

The benchmark reveals a substantial gap between LLMs' reasoning abilities and effective decision-making in dynamic, noisy financial markets—challenging assumptions that general intelligence translates to domain-specific performance.

🎯 ELI5: Testing AI Stock Traders

Imagine giving someone a financial trivia test versus actually letting them manage real money for four months. Passing the test doesn't mean they'll make good trades! StockBench does the "real money test" for AI—giving LLMs daily market information and seeing if they can actually grow a $100,000 portfolio. The result? Most AIs that ace financial quizzes still lose to the simple strategy of "just buy everything and hold it." It's like finding out a chess genius can't actually win at poker.

StockBench Overview
Figure 1: StockBench architecture showing the back-trading environment workflow—from daily market signals through the four-stage agent pipeline to trade execution and portfolio evaluation.

Part 1: Benchmark Design Philosophy

StockBench addresses critical limitations in existing financial AI evaluation. Traditional benchmarks like FinBen test static knowledge through question-answering, but knowing "what is P/E ratio?" differs fundamentally from dynamic decision-making under uncertainty. The benchmark embodies three key design principles:

Core Design Principles

Investment Universe

The benchmark focuses on the top 20 DJIA stocks by index weight, representing blue-chip companies across diverse sectors. This provides sufficient diversification while maintaining tractable complexity for evaluating agent capabilities.

Stock Industry Distribution
Figure 2: Industry distribution of selected stocks, showing sectoral diversity in the investment universe.

Four-Stage Agent Workflow

Each trading day, agents execute a structured decision pipeline:

  1. Portfolio Overview: Agent scans all available stocks and current positions
  2. In-Depth Analysis: Selected stocks examined with fundamental data, price history, and news
  3. Decision Generation: Explicit increase/decrease/hold decisions with reasoning
  4. Execution & Validation: Convert decisions to share quantities; flag liquidity issues or errors

This pipeline mirrors institutional trading workflows while enabling reproducible agent evaluation.

Part 2: Evaluation Metrics

StockBench employs a multi-dimensional evaluation framework capturing both returns and risk management—critical for realistic trading assessment:

Metric Definition What It Measures
Final Return (VT - V0) / V0 × 100% Raw profitability over evaluation period
Max Drawdown Largest peak-to-trough decline Worst-case loss exposure / risk control
Sortino Ratio Risk-adjusted return (downside only) Return per unit of harmful volatility
Composite Rank [z(Return) - z(Drawdown) + z(Sortino)] / 3 Balanced performance across all dimensions
Composite Score = (zreturn - zdrawdown + zsortino) / 3
where zx = standardized score for metric x

Why Sortino Over Sharpe?

Traditional Sharpe ratio penalizes all volatility equally, but upside volatility (prices rising faster than expected) is actually desirable! The Sortino ratio only penalizes downside deviation, providing a more accurate risk-adjusted return metric for asymmetric investment outcomes.

Part 3: Benchmark Results

Testing across state-of-the-art proprietary and open-weight models reveals substantial variation in trading capability:

Rank Model Return (%) Max Drawdown (%) Sortino Ratio
1 Kimi-K2 +1.9% -11.8% 0.0420
2 Qwen3-235B-Ins +2.4% -11.2% 0.0299
3 GLM-4.5 +2.3% -13.7% 0.0295
9 GPT-5 +0.3% -13.1% 0.0132
12 Passive Baseline +0.4% -15.2% 0.0155

Critical Finding: Intelligence ≠ Trading Ability

GPT-5, despite being a frontier model, ranked 9th out of 12, barely matching the passive baseline. This demonstrates that general reasoning capability does not automatically translate to profitable trading decisions. The gap between static knowledge and dynamic execution remains fundamental.

Part 4: Market Regime Sensitivity

Perhaps the most striking finding: agent performance varies dramatically across market conditions.

Market Regime Performance
Figure 3: Cumulative returns across different market regimes—downturn (Jan-Apr 2025) versus upturn (May-Aug 2025), showing dramatic performance inversions.
Market Condition Period Agent Performance vs Baseline
Downturn Jan-Apr 2025 Most agents failed to outperform
Upturn May-Aug 2025 Most agents succeeded

The Bull Market Illusion

During rising markets, any strategy that buys stocks tends to look good. The true test of trading skill is bear market navigation—where agents must recognize when to reduce exposure or exit positions. Current LLM agents show minimal defensive capability, suggesting they may be pattern-matching "buy signals" without understanding market regime transitions.

Part 5: Error Patterns & Scalability

Common Agent Failures

The benchmark identified two major error categories:

Error Type Description Model Impact
Arithmetic Errors Incorrect position sizing, return calculations Reasoning models (O3) better; Instruction models struggle
Schema Errors Malformed output, missing fields Reasoning models worse (more format deviations)

Portfolio Scalability Degradation

"All evaluated models exhibit performance degradation as portfolio size increases, characterized by declining mean returns and rising return volatility."

This suggests current LLMs struggle with the combinatorial complexity of managing multiple positions simultaneously—a critical limitation for real-world asset management applications.

Information Source Importance

Ablation studies revealed that removing data sources progressively degraded performance:

Conclusion

StockBench reveals a substantial gap between LLMs' static financial knowledge and dynamic trading execution. While top-performing agents demonstrate potential for profitability and risk mitigation, the domain requires significant advancement in:

The benchmark positions LLMs as potential trading assistants rather than autonomous traders—complementing the AI-Trader and AlphaAgents findings that LLMs work better as strategy developers than direct market participants.

Primary Sources

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
Chen et al., Tsinghua University, October 2025

StockBench Project Page
Official benchmark website with leaderboard

GitHub Repository
Open-source benchmark implementation