Curse of Instructions: Large Language Models Cannot Follow Multiple Instructions at Once

Keno Harada, Yudai Yamazaki, Masachika Taniguchi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Executive Summary

This research reveals a fundamental limitation in Large Language Models (LLMs): their ability to follow instructions deteriorates significantly as the number of simultaneous instructions increases. Through the introduction of ManyIFEval, a comprehensive benchmark dataset with up to 10 objectively verifiable instructions, researchers demonstrate that state-of-the-art models including GPT-4o, Claude-3.5, Gemini-1.5, Gemma2, and Llama3.1 all exhibit declining performance with increased instruction complexity. This phenomenon, termed the "curse of instructions," follows a mathematical pattern where overall success rates decline exponentially with the number of instructions.

Research Context & Motivation

As Large Language Models become increasingly integrated into production systems requiring complex, multi-step task execution, understanding their limitations with multiple simultaneous instructions becomes critical for system reliability and user experience. Previous benchmarks focused on single-instruction scenarios, leaving a gap in understanding how LLMs handle realistic use cases involving multiple constraints and requirements.

Key Contributions

  1. ManyIFEval Benchmark: A large-scale dataset comprising task prompts with up to 10 objectively verifiable instructions per prompt
  2. Systematic Evaluation: Comprehensive testing of major LLMs (GPT-4o, Claude-3.5, Gemini-1.5, Gemma2, Llama3.1) across varying instruction counts
  3. Mathematical Framework: Identification of the "curse of instructions" phenomenon with formal characterization
  4. Mitigation Strategy: Inference-time self-refinement approach to improve instruction-following performance

Methodology

Benchmark Design: ManyIFEval

Tested Models

Evaluation Approach

  1. Baseline performance measurement across all models
  2. Systematic variation of instruction count (1-10)
  3. Individual instruction success rate tracking
  4. Overall task completion rate measurement
  5. Self-refinement intervention testing

Key Findings

The Curse of Instructions Phenomenon

As instruction count increases, models' ability to follow individual instructions deteriorates. The overall success rate follows a mathematical relationship:

P(all instructions followed) = P(individual instruction)^n

where n is the total number of instructions. This exponential decay means even small decreases in individual instruction following rates lead to dramatic drops in overall task success.

Performance Degradation Across All Models

All tested models, including state-of-the-art commercial systems, exhibited significant performance decline with increased instruction complexity. No model maintained consistent performance across the full 1-10 instruction range.

Performance Results

Baseline Performance (Without Self-Refinement)

Model Single Instruction Multiple Instructions (Avg) Performance Trend
GPT-4o ~85% 15% Significant Decline
Claude 3.5 Sonnet ~90% 44% Moderate Decline
Gemini-1.5 ~82% Variable Significant Decline
Gemma2 ~75% Variable Severe Decline
Llama3.1 ~78% Variable Severe Decline

Self-Refinement Improvement Results

Model Baseline Success Rate With Self-Refinement Improvement
GPT-4o 15% 31% +107% (16pp)
Claude 3.5 Sonnet 44% 58% +32% (14pp)

Self-Refinement Mitigation Strategy

The researchers proposed an inference-time iterative self-refinement technique to improve instruction-following performance:

Self-Refinement Process

  1. Initial Generation: Model generates response to multi-instruction prompt
  2. Self-Evaluation: Model evaluates which instructions were successfully followed
  3. Targeted Refinement: Model regenerates response focusing on unfollowed instructions
  4. Iteration: Process repeats until convergence or maximum iterations

Self-Refinement Effectiveness

Implications for Production Systems

System Design Considerations

Performance Budgeting

For production systems requiring high reliability, the exponential decay formula provides a framework for performance budgeting:

Comparison with Related Work

Multi-Task Inference Research

Contrasting findings from parallel research on multi-task inference (MTI Bench) showed that LLMs can actually improve performance on certain types of multiple simultaneous tasks:

Aspect Curse of Instructions Multi-Task Inference
Task Type Multiple constraints on single output Multiple independent tasks
Performance Trend Declining with count Improving (7-12%)
Speed Impact Neutral to negative 1.46x faster
Key Insight Constraint satisfaction difficulty Context sharing benefits

Key Distinction: The curse of instructions applies specifically to constraints that must all be satisfied in a single output, while multi-task inference benefits apply to independent tasks that can leverage shared context.

Technical Analysis

Root Causes of Performance Degradation

  1. Attention Mechanism Limitations: Difficulty maintaining simultaneous focus on multiple constraints
  2. Working Memory Constraints: Implicit limitations in maintaining multiple active requirements
  3. Priority Ambiguity: Lack of explicit prioritization mechanisms for conflicting instructions
  4. Training Distribution: Under-representation of high-complexity multi-instruction examples in training data

Mathematical Characterization

Success_rate(n) ≈ Success_rate(1)^n

Where:

Future Directions

Research Opportunities

Benchmarking Extensions

Conclusions

This research establishes the "curse of instructions" as a fundamental limitation in current Large Language Models, with significant implications for production system design and reliability engineering. Key takeaways:

For developers building AI-powered applications, this research underscores the importance of:

  1. Decomposing complex requirements into sequential simple instructions
  2. Implementing robust verification mechanisms
  3. Designing for graceful degradation when instruction following is incomplete
  4. Understanding performance budgets based on instruction count

References

Report compiled for AI Agent Engineering Research Collection

For more resources, visit join.maxpool.dev