StockBench introduces the first contamination-free benchmark for evaluating LLM-based trading agents in realistic multi-month market environments. Unlike static financial knowledge tests, this benchmark tests whether AI agents can actually make profitable sequential trading decisions across 82 trading days using real prices, fundamentals, and news for the top 20 DJIA stocks.
The critical finding: strong performance on financial knowledge tasks does not translate to trading success. Most LLM agents struggle against a simple buy-and-hold baseline. The best performer, Kimi-K2, achieved +1.9% return with superior risk management (-11.8% max drawdown), while the passive baseline returned only 0.4% with -15.2% drawdown. However, market regime sensitivity remains problematic—agents failed during downturns (Jan-Apr 2025) while excelling during upturns (May-Aug 2025).
The benchmark reveals a substantial gap between LLMs' reasoning abilities and effective decision-making in dynamic, noisy financial markets—challenging assumptions that general intelligence translates to domain-specific performance.
Imagine giving someone a financial trivia test versus actually letting them manage real money for four months. Passing the test doesn't mean they'll make good trades! StockBench does the "real money test" for AI—giving LLMs daily market information and seeing if they can actually grow a $100,000 portfolio. The result? Most AIs that ace financial quizzes still lose to the simple strategy of "just buy everything and hold it." It's like finding out a chess genius can't actually win at poker.
StockBench addresses critical limitations in existing financial AI evaluation. Traditional benchmarks like FinBen test static knowledge through question-answering, but knowing "what is P/E ratio?" differs fundamentally from dynamic decision-making under uncertainty. The benchmark embodies three key design principles:
The benchmark focuses on the top 20 DJIA stocks by index weight, representing blue-chip companies across diverse sectors. This provides sufficient diversification while maintaining tractable complexity for evaluating agent capabilities.
Each trading day, agents execute a structured decision pipeline:
This pipeline mirrors institutional trading workflows while enabling reproducible agent evaluation.
StockBench employs a multi-dimensional evaluation framework capturing both returns and risk management—critical for realistic trading assessment:
| Metric | Definition | What It Measures |
|---|---|---|
| Final Return | (VT - V0) / V0 × 100% | Raw profitability over evaluation period |
| Max Drawdown | Largest peak-to-trough decline | Worst-case loss exposure / risk control |
| Sortino Ratio | Risk-adjusted return (downside only) | Return per unit of harmful volatility |
| Composite Rank | [z(Return) - z(Drawdown) + z(Sortino)] / 3 | Balanced performance across all dimensions |
Traditional Sharpe ratio penalizes all volatility equally, but upside volatility (prices rising faster than expected) is actually desirable! The Sortino ratio only penalizes downside deviation, providing a more accurate risk-adjusted return metric for asymmetric investment outcomes.
Testing across state-of-the-art proprietary and open-weight models reveals substantial variation in trading capability:
| Rank | Model | Return (%) | Max Drawdown (%) | Sortino Ratio |
|---|---|---|---|---|
| 1 | Kimi-K2 | +1.9% | -11.8% | 0.0420 |
| 2 | Qwen3-235B-Ins | +2.4% | -11.2% | 0.0299 |
| 3 | GLM-4.5 | +2.3% | -13.7% | 0.0295 |
| 9 | GPT-5 | +0.3% | -13.1% | 0.0132 |
| 12 | Passive Baseline | +0.4% | -15.2% | 0.0155 |
GPT-5, despite being a frontier model, ranked 9th out of 12, barely matching the passive baseline. This demonstrates that general reasoning capability does not automatically translate to profitable trading decisions. The gap between static knowledge and dynamic execution remains fundamental.
Perhaps the most striking finding: agent performance varies dramatically across market conditions.
| Market Condition | Period | Agent Performance vs Baseline |
|---|---|---|
| Downturn | Jan-Apr 2025 | Most agents failed to outperform |
| Upturn | May-Aug 2025 | Most agents succeeded |
During rising markets, any strategy that buys stocks tends to look good. The true test of trading skill is bear market navigation—where agents must recognize when to reduce exposure or exit positions. Current LLM agents show minimal defensive capability, suggesting they may be pattern-matching "buy signals" without understanding market regime transitions.
The benchmark identified two major error categories:
| Error Type | Description | Model Impact |
|---|---|---|
| Arithmetic Errors | Incorrect position sizing, return calculations | Reasoning models (O3) better; Instruction models struggle |
| Schema Errors | Malformed output, missing fields | Reasoning models worse (more format deviations) |
"All evaluated models exhibit performance degradation as portfolio size increases, characterized by declining mean returns and rising return volatility."
This suggests current LLMs struggle with the combinatorial complexity of managing multiple positions simultaneously—a critical limitation for real-world asset management applications.
Ablation studies revealed that removing data sources progressively degraded performance:
StockBench reveals a substantial gap between LLMs' static financial knowledge and dynamic trading execution. While top-performing agents demonstrate potential for profitability and risk mitigation, the domain requires significant advancement in:
The benchmark positions LLMs as potential trading assistants rather than autonomous traders—complementing the AI-Trader and AlphaAgents findings that LLMs work better as strategy developers than direct market participants.
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
Chen et al., Tsinghua University, October 2025
StockBench Project Page
Official benchmark website with leaderboard
GitHub Repository
Open-source benchmark implementation