Curse of Instructions: Large Language Models Cannot Follow Multiple Instructions at Once

Executive Summary

This research reveals a fundamental limitation in Large Language Models (LLMs): their ability to follow instructions deteriorates significantly as the number of simultaneous instructions increases. Through the introduction of ManyIFEval, a comprehensive benchmark dataset with up to 10 objectively verifiable instructions, researchers demonstrate that state-of-the-art models including GPT-4o, Claude-3.5, Gemini-1.5, Gemma2, and Llama3.1 all exhibit declining performance with increased instruction complexity. This phenomenon, termed the "curse of instructions," follows a mathematical pattern where overall success rates decline exponentially with the number of instructions.

Research Context & Motivation

As Large Language Models become increasingly integrated into production systems requiring complex, multi-step task execution, understanding their limitations with multiple simultaneous instructions becomes critical for system reliability and user experience. Previous benchmarks focused on single-instruction scenarios, leaving a gap in understanding how LLMs handle realistic use cases involving multiple constraints and requirements.

Key Contributions

Methodology

Benchmark Design: ManyIFEval

SCALE Large-scale dataset with systematic instruction count variation
VERIFICATION Objectively verifiable instructions for automated evaluation
RANGE 1-10 simultaneous instructions per prompt
DIVERSITY Multiple instruction types and task domains

Tested Models

GPT-4o - OpenAI's latest multimodal model
Claude-3.5 Sonnet - Anthropic's advanced reasoning model
Gemini-1.5 - Google's latest generation model
Gemma2 - Google's open-source model family
Llama3.1 - Meta's latest open-source model

Evaluation Approach

Baseline performance measurement across all models
Systematic variation of instruction count (1-10)
Individual instruction success rate tracking
Overall task completion rate measurement
Self-refinement intervention testing

Key Findings

The Curse of Instructions Phenomenon

As instruction count increases, models' ability to follow individual instructions deteriorates. The overall success rate follows a mathematical relationship:

P(all instructions followed) = P(individual instruction)^n

where n is the total number of instructions. This exponential decay means even small decreases in individual instruction following rates lead to dramatic drops in overall task success.

Performance Degradation Across All Models

All tested models, including state-of-the-art commercial systems, exhibited significant performance decline with increased instruction complexity. No model maintained consistent performance across the full 1-10 instruction range.

Performance Results

Baseline Performance (Without Self-Refinement)

Self-Refinement Improvement Results

Self-Refinement Mitigation Strategy

The researchers proposed an inference-time iterative self-refinement technique to improve instruction-following performance:

Self-Refinement Process

Self-Refinement Effectiveness

Implications for Production Systems

System Design Considerations

Performance Budgeting

Model	Single Instruction	Multiple Instructions (Avg)	Performance Trend
GPT-4o	~85%	15%	Significant Decline
Claude 3.5 Sonnet	~90%	44%	Moderate Decline
Gemini-1.5	~82%	Variable	Significant Decline
Gemma2	~75%	Variable	Severe Decline
Llama3.1	~78%	Variable	Severe Decline

Model	Baseline Success Rate	With Self-Refinement	Improvement
GPT-4o	15%	31%	+107% (16pp)
Claude 3.5 Sonnet	44%	58%	+32% (14pp)

For production systems requiring high reliability, the exponential decay formula provides a framework for performance budgeting:

Target reliability: 95% → Maximum ~2 instructions (assuming 97.5% individual success rate)
Target reliability: 90% → Maximum ~3 instructions (assuming 96.5% individual success rate)
Target reliability: 80% → Maximum ~4-5 instructions (assuming 95% individual success rate)

Comparison with Related Work

Multi-Task Inference Research

Contrasting findings from parallel research on multi-task inference (MTI Bench) showed that LLMs can actually improve performance on certain types of multiple simultaneous tasks:

Aspect	Curse of Instructions	Multi-Task Inference
Task Type	Multiple constraints on single output	Multiple independent tasks
Performance Trend	Declining with count	Improving (7-12%)
Speed Impact	Neutral to negative	1.46x faster
Key Insight	Constraint satisfaction difficulty	Context sharing benefits

Key Distinction: The curse of instructions applies specifically to constraints that must all be satisfied in a single output, while multi-task inference benefits apply to independent tasks that can leverage shared context.

Technical Analysis

Root Causes of Performance Degradation

Mathematical Characterization

Future Directions

Research Opportunities

Benchmarking Extensions

Conclusions

This research establishes the "curse of instructions" as a fundamental limitation in current Large Language Models, with significant implications for production system design and reliability engineering. Key takeaways:

Universal Limitation: All tested models, including state-of-the-art systems, exhibit performance degradation with increased instruction complexity
Exponential Decay: Success rates follow a mathematical relationship that compounds individual instruction following rates
Partial Mitigation: Self-refinement techniques provide meaningful improvements but don't eliminate the fundamental limitation
System Design Impact: Production systems should be architected with explicit consideration of multi-instruction limitations

For developers building AI-powered applications, this research underscores the importance of:

Decomposing complex requirements into sequential simple instructions
Implementing robust verification mechanisms
Designing for graceful degradation when instruction following is incomplete
Understanding performance budgets based on instruction count