Research Papers - AI Agent Engineering

Title	Description
Vending-Bench & Project Vend: Long-Term Coherence of Autonomous Agents New Finance Benchmarks Agents	Synthesizes Vending-Bench 2 (Andon Labs) and Project Vend (Anthropic) testing long-horizon agent coherence. Vending-Bench 2 leaderboard: Gemini 3 Pro ($5,478), Claude Opus 4.5 ($4,967) from $500 start—theoretical optimal ~$63K shows 10× headroom. Project Vend deployed Claude "Claudius" in Anthropic's SF office: Phase 1 failures (pricing below cost, hallucinated payments, identity crisis). Phase 2 introduced multi-agent hierarchy with CEO "Seymour Cash" applying profit pressure—transformed money-losing shop into profitable venture. Key insight: models trained for helpfulness struggle with hard-nosed business decisions, operating "like a friend who just wants to be nice." Procedural checks (forcing price verification) most effective intervention.
The Shadow Value of "Public" Information: AI vs Human Fund Managers New Finance Benchmarks	Stanford GSB research demonstrating AI analyst outperformed 93% of mutual fund managers over 30 years (1990-2020) using only publicly available data. AI-adjusted portfolios generated $17.1M quarterly alpha versus human managers' $2.8M—a ~600% improvement. Tested on 3,300 diversified U.S. equity funds using 170 public variables (Treasury rates, credit ratings, earnings call sentiment, firm size, trading volume). Counterintuitively, AI primarily relied on simple variables but deployed sophisticated machine learning to extract maximum predictive value. Introduces "shadow price" concept—the hidden processing cost of extracting value from free data.
StockBench: Can LLM Agents Trade Stocks Profitably in Real-world Markets? New Finance Benchmarks Agents	First contamination-free benchmark evaluating whether LLM agents can profitably execute sequential trading decisions across 82 trading days using real DJIA stock prices, fundamentals, and news. Tests state-of-the-art models (GPT-5, Kimi-K2, Qwen3-235B, Claude-4-Sonnet) against buy-and-hold baseline. Critical finding: general intelligence does not translate to trading ability—GPT-5 ranked 9th of 12, barely matching passive strategy. Best performer Kimi-K2 achieved +1.9% return with -11.8% max drawdown vs baseline's +0.4% with -15.2% drawdown.
AlphaAgents: LLM Multi-Agent System for Equity Portfolio Construction New Finance Multi-Agent Agents	BlackRock research introducing role-based multi-agent framework for systematic stock selection using three specialized LLM agents: Fundamental (10-K/10-Q analysis), Sentiment (news and analyst ratings), and Valuation (price and volume metrics). Built on Microsoft AutoGen, agents engage in structured Round Robin debate when analyses diverge, producing consensus recommendations with transparent reasoning trails. Framework mirrors institutional investment committee reasoning, providing audit-ready discussion logs for regulatory compliance.
AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets New Finance Benchmarks Agents	First fully-automated, live, data-uncontaminated benchmark for LLM trading agents, testing six mainstream models across three markets: U.S. stocks (NASDAQ-100), Chinese A-shares (SSE 50), and cryptocurrencies (10 major assets). Implements "fully autonomous minimal information paradigm" where agents independently search, verify, and synthesize live market data without human assistance. Key finding: general intelligence does not translate to trading ability. Provides live leaderboard at ai4trade.ai.
ProFiT: Program Search for Financial Trading New Finance Training Agents	LLM-driven evolutionary framework for autonomous discovery and improvement of algorithmic trading strategies. Unlike traditional approaches that tune parameters within fixed architectures, ProFiT evolves executable Python source code of trading strategies. Achieves +44.21% mean improvement in annualized return over seed strategies, +0.57 Sharpe ratio improvement, with 77%+ of evolved strategies beating Buy-and-Hold across seven liquid futures assets.
Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents New Agents Training Architecture	Landmark framework enabling AI systems to autonomously modify their own code for improved problem-solving. Replaces theoretical Gödel Machine's formal proofs with empirical benchmark validation. Achieves +150% improvement on SWE-bench (20%→50%) and +116% on Polyglot (14.2%→30.7%) through iterative self-modification cycles. Combines self-referential improvement with open-ended exploration maintaining an archive of all viable agents as stepping stones. Open-sourced at github.com/jennyzzt/dgm.
🚇 The Evolution of LLM Reasoning: A Metro Map Reasoning Learning	Interactive visual timeline tracing the evolution of reasoning in large language models from Chain-of-Thought (2022) through Tree of Thoughts, ReAct, and modern Large Reasoning Models (2025). Organized as a "metro map" with four lines representing different paradigms: Chain-of-Thought family, Action/Agentic reasoning, Tree/Search methods, and Program-based approaches. Features ARC Prize 2025 analysis showing current state: winner at 24%, Gemini+refinement at 54%, humans at 85%.
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems Reasoning Architecture	Landmark theoretical paper proving why chain-of-thought prompting works: CoT enables transformers to perform serial computation otherwise impossible in parallel architectures. Establishes that constant-depth transformers without CoT solve only AC⁰ problems, but with T CoT steps can compute any problem solvable by size-T circuits. Key insight: CoT length should match problem's "serial depth"—explaining why step-by-step reasoning helps arithmetic and planning but not pattern matching.
Towards a Science of Scaling Agent Systems: Quantitative Principles for Multi-Agent Coordination Multi-Agent Agents Benchmarks	Landmark empirical study establishing quantitative scaling principles for multi-agent systems through controlled evaluation of 180 configurations across three LLM families (OpenAI, Google, Anthropic) and four benchmarks. Reveals highly heterogeneous MAS performance (+81% improvement to -70% degradation) determined by task structure, not agent count. Introduces predictive mixed-effects model achieving R²=0.513 cross-validation accuracy and 87% optimal architecture prediction.
Measuring Agents in Production: Empirical Study of 306 Practitioners Agents Reliability Benchmarks	Landmark empirical study surveying 306 practitioners and 20 in-depth interviews across 26 domains reveals how AI agents actually work in production. Key findings: 68% execute ≤10 steps before human intervention, 70% use off-the-shelf models without fine-tuning, 85% build custom implementations over frameworks, and 74% rely on human evaluation rather than benchmarks. Demonstrates that successful teams deliberately trade capability for controllability.
Continuous Thought Machines: Neural Synchronization for Emergent Intelligence Architecture Reasoning	Groundbreaking architecture from Sakana AI treating temporal dynamics as fundamental computation. Introduces Neuron-Level Models (NLMs) giving each neuron private temporal processing, and synchronization matrices capturing neural correlation patterns as core representations. Achieves 6× generalization beyond training on mazes (39×39 → 99×99), near-perfect accuracy on cumulative parity where LSTMs fail, and better-than-human calibration on image classification.
Ilya's Favorite Papers: A Curated Learning Path for AI Learning Architecture	Comprehensive collection of ~40 foundational papers curated by Ilya Sutskever (former Chief Scientist at OpenAI) as the definitive learning path for understanding modern AI. Covers progression from CNNs and RNNs to Transformers, explores attention mechanisms, memory-augmented networks, and theoretical foundations in information theory and complexity science.
When Will AI Become Reliable? Half-Life Analysis & Long Task Completion Reliability Agents	Synthesis of Toby Ord's half-life framework with METR's exponential growth analysis. Reveals AI agents fail at constant rate per minute (half-life model) while capabilities double every 7 months. Projects specific reliability thresholds: 90% reliability requires 1/7 task duration reduction, current models achieve 50-minute tasks at 50% success. Predicts month-long task automation by 2030.
Where LLM Agents Fail and How They Can Learn: Systematic Error Analysis & Remediation Reliability Agents Memory	Groundbreaking framework for understanding agent failures through cascading error analysis. Introduces AgentErrorTaxonomy classifying failures across memory, reflection, planning, action, and system operations. AgentDebug framework achieves 24% higher accuracy and 220% increase in error recovery by identifying root causes and delivering corrective feedback.
DreamGym: Scaling Agent Learning via Experience Synthesis Training Agents	Breakthrough framework for training AI agents through synthetic experience synthesis. Introduces reasoning-based experience model that simulates environment dynamics, enabling scalable reinforcement learning without costly real-world interactions. Achieves 30%+ improvement on non-RL-ready tasks like WebArena using zero real environment interactions.
DeepSeek-OCR: Contexts Optical Compression Architecture Memory	Revolutionary approach treating vision as compression medium for text processing. Introduces DeepEncoder achieving 7-20× text compression with 97% accuracy at 10× ratio through serial window+global attention architecture. Demonstrates 200k+ pages/day throughput while outperforming models using 30× more tokens.
AgentFlow: In-the-Flow Agentic System Optimization for Effective Planning and Tool Use Agents Training	Novel trainable agentic framework coordinating four specialized modules (planner, executor, verifier, generator) through evolving memory. Introduces Flow-GRPO training method enabling direct optimization within live multi-turn interactions. Demonstrates 7B models surpassing GPT-4o with 14.9% gains on search tasks, 14.0% on agentic tasks.
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models Memory Agents	Revolutionary approach treating contexts as "evolving playbooks" that accumulate detailed strategies through generation, reflection, and curation cycles. Introduces incremental delta updates achieving 82.3% reduction in adaptation latency versus GEPA and 83.6% token cost reduction versus Dynamic Cheatsheet. Matches GPT-4.1 production agent performance using smaller open-source models.
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory Memory Reasoning Agents	Novel memory framework enabling AI agents to learn from both successful and failed experiences by distilling generalizable reasoning strategies. Introduces Memory-aware Test-Time Scaling (MaTTS) that creates synergy between memory quality and computational scaling. Demonstrates up to 34.2% relative improvement across web browsing and software engineering tasks.
Curse of Instructions: Large Language Models Cannot Follow Multiple Instructions at Once Reliability Benchmarks	Comprehensive analysis revealing fundamental limitations in LLMs' ability to follow multiple simultaneous instructions. Introduces ManyIFEval benchmark showing exponential performance decay with instruction count, with GPT-4o, Claude-3.5, and other models tested. Includes self-refinement mitigation strategies and production implications.
AI Benchmark Critique: Evidence of Invalid 2026 Predictions PDF Benchmarks	Critical analysis of METR and GDPval benchmarks, revealing statistical flaws, baseline inflation errors, and invalid extrapolation methods
Recursive Self-Aggregation: Deep Thinking and Test-Time Scaling for LLM Reasoning PDF Reasoning Training	Groundbreaking test-time scaling method enabling smaller models to match larger reasoning models through iterative aggregation of reasoning chains
The OaK Architecture: A Paradigm Shift in Artificial General Intelligence PDF Architecture Learning	Rich Sutton's vision for experience-based superintelligence through continual learning, hierarchical abstraction, and reward maximization