Curse of Instructions: Large Language Models Cannot Follow Multiple Instructions at Once
Keno Harada, Yudai Yamazaki, Masachika Taniguchi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Executive Summary
This research reveals a fundamental limitation in Large Language Models (LLMs): their ability to follow instructions deteriorates significantly as the number of simultaneous instructions increases. Through the introduction of ManyIFEval, a comprehensive benchmark dataset with up to 10 objectively verifiable instructions, researchers demonstrate that state-of-the-art models including GPT-4o, Claude-3.5, Gemini-1.5, Gemma2, and Llama3.1 all exhibit declining performance with increased instruction complexity. This phenomenon, termed the "curse of instructions," follows a mathematical pattern where overall success rates decline exponentially with the number of instructions.
Research Context & Motivation
As Large Language Models become increasingly integrated into production systems requiring complex, multi-step task execution, understanding their limitations with multiple simultaneous instructions becomes critical for system reliability and user experience. Previous benchmarks focused on single-instruction scenarios, leaving a gap in understanding how LLMs handle realistic use cases involving multiple constraints and requirements.
Key Contributions
- ManyIFEval Benchmark: A large-scale dataset comprising task prompts with up to 10 objectively verifiable instructions per prompt
- Systematic Evaluation: Comprehensive testing of major LLMs (GPT-4o, Claude-3.5, Gemini-1.5, Gemma2, Llama3.1) across varying instruction counts
- Mathematical Framework: Identification of the "curse of instructions" phenomenon with formal characterization
- Mitigation Strategy: Inference-time self-refinement approach to improve instruction-following performance
Methodology
Benchmark Design: ManyIFEval
- SCALE Large-scale dataset with systematic instruction count variation
- VERIFICATION Objectively verifiable instructions for automated evaluation
- RANGE 1-10 simultaneous instructions per prompt
- DIVERSITY Multiple instruction types and task domains
Tested Models
- GPT-4o - OpenAI's latest multimodal model
- Claude-3.5 Sonnet - Anthropic's advanced reasoning model
- Gemini-1.5 - Google's latest generation model
- Gemma2 - Google's open-source model family
- Llama3.1 - Meta's latest open-source model
Evaluation Approach
- Baseline performance measurement across all models
- Systematic variation of instruction count (1-10)
- Individual instruction success rate tracking
- Overall task completion rate measurement
- Self-refinement intervention testing
Key Findings
The Curse of Instructions Phenomenon
As instruction count increases, models' ability to follow individual instructions deteriorates. The overall success rate follows a mathematical relationship:
P(all instructions followed) = P(individual instruction)^n
where n is the total number of instructions. This exponential decay means even small decreases in individual instruction following rates lead to dramatic drops in overall task success.
Performance Degradation Across All Models
All tested models, including state-of-the-art commercial systems, exhibited significant performance decline with increased instruction complexity. No model maintained consistent performance across the full 1-10 instruction range.
Performance Results
Baseline Performance (Without Self-Refinement)
Model |
Single Instruction |
Multiple Instructions (Avg) |
Performance Trend |
GPT-4o |
~85% |
15% |
Significant Decline |
Claude 3.5 Sonnet |
~90% |
44% |
Moderate Decline |
Gemini-1.5 |
~82% |
Variable |
Significant Decline |
Gemma2 |
~75% |
Variable |
Severe Decline |
Llama3.1 |
~78% |
Variable |
Severe Decline |
Self-Refinement Improvement Results
Model |
Baseline Success Rate |
With Self-Refinement |
Improvement |
GPT-4o |
15% |
31% |
+107% (16pp) |
Claude 3.5 Sonnet |
44% |
58% |
+32% (14pp) |
Self-Refinement Mitigation Strategy
The researchers proposed an inference-time iterative self-refinement technique to improve instruction-following performance:
Self-Refinement Process
- Initial Generation: Model generates response to multi-instruction prompt
- Self-Evaluation: Model evaluates which instructions were successfully followed
- Targeted Refinement: Model regenerates response focusing on unfollowed instructions
- Iteration: Process repeats until convergence or maximum iterations
Self-Refinement Effectiveness
- GPT-4o: 107% relative improvement (15% → 31%)
- Claude 3.5 Sonnet: 32% relative improvement (44% → 58%)
- Limitation: While significant, self-refinement doesn't fully solve the curse of instructions
- Cost Trade-off: Requires multiple inference passes, increasing computational cost
Implications for Production Systems
System Design Considerations
- Instruction Decomposition: Break complex multi-instruction prompts into sequential single-instruction calls
- Verification Layers: Implement automated verification of instruction compliance
- Fallback Strategies: Design systems with graceful degradation for incomplete instruction following
- User Interface Design: Limit simultaneous instruction complexity in user-facing applications
Performance Budgeting
For production systems requiring high reliability, the exponential decay formula provides a framework for performance budgeting:
- Target reliability: 95% → Maximum ~2 instructions (assuming 97.5% individual success rate)
- Target reliability: 90% → Maximum ~3 instructions (assuming 96.5% individual success rate)
- Target reliability: 80% → Maximum ~4-5 instructions (assuming 95% individual success rate)
Comparison with Related Work
Multi-Task Inference Research
Contrasting findings from parallel research on multi-task inference (MTI Bench) showed that LLMs can actually improve performance on certain types of multiple simultaneous tasks:
Aspect |
Curse of Instructions |
Multi-Task Inference |
Task Type |
Multiple constraints on single output |
Multiple independent tasks |
Performance Trend |
Declining with count |
Improving (7-12%) |
Speed Impact |
Neutral to negative |
1.46x faster |
Key Insight |
Constraint satisfaction difficulty |
Context sharing benefits |
Key Distinction: The curse of instructions applies specifically to constraints that must all be satisfied in a single output, while multi-task inference benefits apply to independent tasks that can leverage shared context.
Technical Analysis
Root Causes of Performance Degradation
- Attention Mechanism Limitations: Difficulty maintaining simultaneous focus on multiple constraints
- Working Memory Constraints: Implicit limitations in maintaining multiple active requirements
- Priority Ambiguity: Lack of explicit prioritization mechanisms for conflicting instructions
- Training Distribution: Under-representation of high-complexity multi-instruction examples in training data
Mathematical Characterization
Success_rate(n) ≈ Success_rate(1)^n
Where:
- n = number of instructions
- Success_rate(1) = success rate on individual instructions
- Success_rate(n) = probability of following all n instructions
Future Directions
Research Opportunities
- Architectural Improvements: Explicit instruction-tracking mechanisms in model architectures
- Training Methodologies: Curriculum learning with progressive instruction complexity
- Prompt Engineering: Optimal instruction formatting and ordering strategies
- Hybrid Approaches: Combining LLMs with symbolic constraint solvers
Benchmarking Extensions
- Domain-specific instruction sets (code generation, creative writing, data analysis)
- Instruction conflict resolution scenarios
- Long-context multi-instruction following
- Multi-modal instruction following
Conclusions
This research establishes the "curse of instructions" as a fundamental limitation in current Large Language Models, with significant implications for production system design and reliability engineering. Key takeaways:
- Universal Limitation: All tested models, including state-of-the-art systems, exhibit performance degradation with increased instruction complexity
- Exponential Decay: Success rates follow a mathematical relationship that compounds individual instruction following rates
- Partial Mitigation: Self-refinement techniques provide meaningful improvements but don't eliminate the fundamental limitation
- System Design Impact: Production systems should be architected with explicit consideration of multi-instruction limitations
For developers building AI-powered applications, this research underscores the importance of:
- Decomposing complex requirements into sequential simple instructions
- Implementing robust verification mechanisms
- Designing for graceful degradation when instruction following is incomplete
- Understanding performance budgets based on instruction count
References
- Harada, K., Yamazaki, Y., Taniguchi, M., Kojima, T., Iwasawa, Y., & Matsuo, Y. "Curse of Instructions: Large Language Models Cannot Follow Multiple Instructions at Once." OpenReview.
- Related: Multi-Task Inference research (MTI Bench) on simultaneous task processing benefits
- ManyIFEval Benchmark Dataset (https://openreview.net/forum?id=R6q67CDBCH)
Report compiled for AI Agent Engineering Research Collection
For more resources, visit join.maxpool.dev