StockBench: Can LLM Agents Trade Stocks Profitably?

Executive Summary

StockBench introduces the first contamination-free benchmark for evaluating LLM-based trading agents in realistic multi-month market environments. Unlike static financial knowledge tests, this benchmark tests whether AI agents can actually make profitable sequential trading decisions across 82 trading days using real prices, fundamentals, and news for the top 20 DJIA stocks.

The critical finding: strong performance on financial knowledge tasks does not translate to trading success. Most LLM agents struggle against a simple buy-and-hold baseline. The best performer, Kimi-K2, achieved +1.9% return with superior risk management (-11.8% max drawdown), while the passive baseline returned only 0.4% with -15.2% drawdown. However, market regime sensitivity remains problematic—agents failed during downturns (Jan-Apr 2025) while excelling during upturns (May-Aug 2025).

The benchmark reveals a substantial gap between LLMs' reasoning abilities and effective decision-making in dynamic, noisy financial markets—challenging assumptions that general intelligence translates to domain-specific performance.

🎯 ELI5: Testing AI Stock Traders

Imagine giving someone a financial trivia test versus actually letting them manage real money for four months. Passing the test doesn't mean they'll make good trades! StockBench does the "real money test" for AI—giving LLMs daily market information and seeing if they can actually grow a $100,000 portfolio. The result? Most AIs that ace financial quizzes still lose to the simple strategy of "just buy everything and hold it." It's like finding out a chess genius can't actually win at poker.

Part 1: Benchmark Design Philosophy

StockBench addresses critical limitations in existing financial AI evaluation. Traditional benchmarks like FinBen test static knowledge through question-answering, but knowing "what is P/E ratio?" differs fundamentally from dynamic decision-making under uncertainty. The benchmark embodies three key design principles:

Core Design Principles

Realistic: Uses actual market prices, fundamental metrics (P/E ratio, market cap, dividend yield), and top-5 news articles from previous 48 hours
Continuous: 82 trading days of sequential decisions, not isolated predictions—replicating how real traders operate
Contamination-Free: Data period (March-June 2025) post-dates all LLM training cutoffs, with commitment to continuous dataset updates

Investment Universe

The benchmark focuses on the top 20 DJIA stocks by index weight, representing blue-chip companies across diverse sectors. This provides sufficient diversification while maintaining tractable complexity for evaluating agent capabilities.

Four-Stage Agent Workflow

Each trading day, agents execute a structured decision pipeline:

Portfolio Overview: Agent scans all available stocks and current positions
In-Depth Analysis: Selected stocks examined with fundamental data, price history, and news
Decision Generation: Explicit increase/decrease/hold decisions with reasoning
Execution & Validation: Convert decisions to share quantities; flag liquidity issues or errors

This pipeline mirrors institutional trading workflows while enabling reproducible agent evaluation.

Part 2: Evaluation Metrics

StockBench employs a multi-dimensional evaluation framework capturing both returns and risk management—critical for realistic trading assessment:

Metric	Definition	What It Measures
Final Return	(V_T - V₀) / V₀ × 100%	Raw profitability over evaluation period
Max Drawdown	Largest peak-to-trough decline	Worst-case loss exposure / risk control
Sortino Ratio	Risk-adjusted return (downside only)	Return per unit of harmful volatility
Composite Rank	[z(Return) - z(Drawdown) + z(Sortino)] / 3	Balanced performance across all dimensions

Why Sortino Over Sharpe?

Traditional Sharpe ratio penalizes all volatility equally, but upside volatility (prices rising faster than expected) is actually desirable! The Sortino ratio only penalizes downside deviation, providing a more accurate risk-adjusted return metric for asymmetric investment outcomes.

Part 3: Benchmark Results

Testing across state-of-the-art proprietary and open-weight models reveals substantial variation in trading capability:

Rank	Model	Return (%)	Max Drawdown (%)	Sortino Ratio
1	Kimi-K2	+1.9%	-11.8%	0.0420
2	Qwen3-235B-Ins	+2.4%	-11.2%	0.0299
3	GLM-4.5	+2.3%	-13.7%	0.0295
9	GPT-5	+0.3%	-13.1%	0.0132
12	Passive Baseline	+0.4%	-15.2%	0.0155

Critical Finding: Intelligence ≠ Trading Ability

GPT-5, despite being a frontier model, ranked 9th out of 12, barely matching the passive baseline. This demonstrates that general reasoning capability does not automatically translate to profitable trading decisions. The gap between static knowledge and dynamic execution remains fundamental.

Part 4: Market Regime Sensitivity

Perhaps the most striking finding: agent performance varies dramatically across market conditions.

Market Condition	Period	Agent Performance vs Baseline
Downturn	Jan-Apr 2025	Most agents failed to outperform
Upturn	May-Aug 2025	Most agents succeeded

The Bull Market Illusion

During rising markets, any strategy that buys stocks tends to look good. The true test of trading skill is bear market navigation—where agents must recognize when to reduce exposure or exit positions. Current LLM agents show minimal defensive capability, suggesting they may be pattern-matching "buy signals" without understanding market regime transitions.

Part 5: Error Patterns & Scalability

Common Agent Failures

Error Type	Description	Model Impact
Arithmetic Errors	Incorrect position sizing, return calculations	Reasoning models (O3) better; Instruction models struggle
Schema Errors	Malformed output, missing fields	Reasoning models worse (more format deviations)

Portfolio Scalability Degradation

"All evaluated models exhibit performance degradation as portfolio size increases, characterized by declining mean returns and rising return volatility."

This suggests current LLMs struggle with the combinatorial complexity of managing multiple positions simultaneously—a critical limitation for real-world asset management applications.

Information Source Importance

Ablation studies revealed that removing data sources progressively degraded performance:

Conclusion

StockBench reveals a substantial gap between LLMs' static financial knowledge and dynamic trading execution. While top-performing agents demonstrate potential for profitability and risk mitigation, the domain requires significant advancement in:

Market Regime Adaptation: Current agents fail to recognize and respond to bear market conditions
Portfolio Scalability: Performance degrades with position count—limiting practical utility
Schema Compliance: Reasoning models produce more format errors despite better calculations
Defensive Strategies: No agent demonstrated consistent loss-limiting behavior

The benchmark positions LLMs as potential trading assistants rather than autonomous traders—complementing the AI-Trader and AlphaAgents findings that LLMs work better as strategy developers than direct market participants.

Primary Sources

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
Chen et al., Tsinghua University, October 2025

StockBench Project Page
Official benchmark website with leaderboard

GitHub Repository
Open-source benchmark implementation

StockBench: Can LLM Agents Trade Stocks Profitably?A Contamination-Free Multi-Month Benchmark