Ilya's Favorite Papers
A Curated Learning Path for AI
Curated by Ilya Sutskever (Former Chief Scientist, OpenAI)
Recommended to John Carmack in 2019
About This Collection
In 2019, Ilya Sutskever—former Chief Scientist at OpenAI and one of the leading figures of the Google Brain era—sent computer scientist John Carmack a carefully curated list of ~40 papers to learn about AI. This collection represents a masterclass in AI education from one of the field's most influential researchers.
These papers span the foundations of deep learning, from convolutional networks and recurrent architectures to attention mechanisms and transformers. They also venture into information theory, complexity science, and philosophical foundations of artificial intelligence. Together, they form a comprehensive curriculum for understanding modern AI.
🎯 How to Use This Guide
This collection is organized into thematic sections, progressing from foundational concepts to advanced techniques and theoretical principles. Each paper includes a brief summary and key learnings to help you extract the most important insights. Whether you're a beginner or experienced practitioner, you can follow the path sequentially or jump to sections that interest you most.
Foundations Building Blocks of Deep Learning
Start here to understand the core concepts that revolutionized computer vision and established deep learning as a dominant paradigm. These papers demonstrate how neural networks learn hierarchical representations and why depth matters.
CS231n: Convolutional Neural Networks for Visual Recognition
Stanford University Course
Course Website
Stanford's flagship deep learning course covering convolutional neural networks from first principles. Provides hands-on implementation experience with backpropagation, optimization, and modern architectures.
Key Learnings:
- How convolutions capture spatial hierarchies in images
- Backpropagation mechanics and computational graphs
- Practical training techniques: batch normalization, dropout, data augmentation
- Architecture patterns: VGG, ResNet, and their design principles
ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012)
Paper
AlexNet—the breakthrough that launched the deep learning revolution. Won ImageNet 2012 by a massive margin, demonstrating that deep CNNs could dramatically outperform traditional computer vision methods.
Key Learnings:
- GPU acceleration made training large networks practical (60× speedup)
- ReLU activation enables faster training than sigmoid/tanh
- Dropout prevents overfitting in large networks
- Data augmentation (translations, reflections) improves generalization
- Depth matters: 8 layers achieved unprecedented accuracy
Understanding LSTM Networks
Christopher Olah (2015)
Blog Post
The definitive visual explanation of Long Short-Term Memory networks. Breaks down the architecture's gates and cell states with intuitive diagrams, making complex concepts accessible.
Key Learnings:
- The vanishing gradient problem limits standard RNNs to short-term dependencies
- LSTM's cell state acts as a "memory highway" preserving information across time
- Forget gate decides what information to discard from cell state
- Input gate controls what new information to store
- Output gate determines what to output based on cell state
Core Architectures Modern Network Design
These papers introduced architectural innovations that became standard practice. ResNets solved the degradation problem in very deep networks, while dilated convolutions enabled efficient receptive field expansion.
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015)
ArXiv
ResNets revolutionized deep learning by introducing skip connections, enabling networks with 100+ layers. Won ImageNet 2015 and became the most influential architecture of the 2010s.
Key Learnings:
- Deep networks aren't harder to optimize—they're harder to learn identity mappings
- Skip connections (residual connections) let gradients flow directly to earlier layers
- Learning residuals F(x) instead of H(x) is easier: H(x) = F(x) + x
- Enabled training networks 8× deeper (152 vs 19 layers) with lower error
- Bottleneck design (1×1, 3×3, 1×1) reduces parameters while maintaining performance
Identity Mappings in Deep Residual Networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2016)
ArXiv
Follow-up analysis revealing that "clean" identity paths (pre-activation ResNets) enable even deeper and more accurate networks by ensuring unimpeded gradient flow.
Key Learnings:
- Pre-activation (BN-ReLU-Conv) outperforms post-activation (Conv-BN-ReLU)
- Identity shortcuts must be "clean"—no activations or normalization on skip connections
- Enables training networks exceeding 1000 layers
- Provides theoretical analysis: identity shortcuts create ensemble-like behavior
Multi-Scale Context Aggregation by Dilated Convolutions
Fisher Yu, Vladlen Koltun (2015)
ArXiv
Introduced dilated (atrous) convolutions for dense prediction tasks. Enables exponential receptive field expansion without losing resolution or adding parameters.
Key Learnings:
- Dilated convolutions insert "holes" to increase receptive field without pooling
- Maintains spatial resolution for dense prediction (segmentation, detection)
- Multi-scale context aggregation improves boundary delineation
- Became standard in semantic segmentation architectures
Attention & Transformers The Revolution in Sequence Modeling
Attention mechanisms fundamentally changed how models process sequences. These papers trace the evolution from additive attention in NMT to the self-attention of Transformers—the architecture that powers modern LLMs.
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (2014)
ArXiv
Introduced attention mechanisms for sequence-to-sequence models. Solved the bottleneck problem where encoders had to compress entire sentences into fixed-size vectors.
Key Learnings:
- Attention lets decoders focus on relevant encoder states at each step
- Alignment weights are learned, not hand-coded
- Dramatically improves translation quality on long sentences
- Attention weights provide interpretability—shows what model "looks at"
- Foundation for all subsequent attention mechanisms
Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. (2017)
ArXiv
The Transformer architecture—arguably the most important paper in modern AI. Replaced recurrence with self-attention, enabling parallel training and scaling to billions of parameters.
Key Learnings:
- Self-attention computes relationships between all positions in parallel
- Multi-head attention captures different types of relationships
- Positional encodings inject sequence order information
- Layer normalization and residual connections stabilize deep models
- Achieved SOTA on translation with 1/10th training cost of RNN models
- Became foundation for BERT, GPT, and all modern LLMs
The Annotated Transformer
Sasha Rush, et al. (2018)
Blog
Code
Line-by-line implementation guide to the Transformer paper. Combines paper walkthrough with working PyTorch code, making the architecture accessible and reproducible.
Key Learnings:
- Attention is just Q×K^T scaled dot product followed by softmax
- Feed-forward networks apply same transformation to each position independently
- Label smoothing and warmup learning rate schedules improve training
- Practical implementation details often omitted from papers
Recurrent Networks Sequential Processing and Memory
Before Transformers dominated, RNNs were the workhorse for sequential data. These papers explore their capabilities, limitations, and extensions with external memory.
The Unreasonable Effectiveness of Recurrent Neural Networks
Andrej Karpathy (2015)
Blog
Code
Influential blog post demonstrating the surprising capabilities of character-level RNNs. Shows how simple models can learn complex structures like code syntax and LaTeX formatting.
Key Learnings:
- Character-level models can generate syntactically valid code and markup
- RNNs learn hierarchical structure without explicit structure in input
- Temperature parameter controls generation randomness vs coherence
- Visualizing activations reveals learned representations (quote detection, line length, etc.)
- Simple architectures + lots of data often beats complex approaches
Recurrent Neural Network Regularization
Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals (2014)
ArXiv
Code
Demonstrates that dropout in RNNs should only be applied to non-recurrent connections. Established best practices for regularizing recurrent models.
Key Learnings:
- Standard dropout on recurrent connections hurts performance
- Apply dropout only to input→hidden and hidden→output connections
- Enables training large LSTMs (1500 hidden units) without overfitting
- Achieved state-of-the-art on Penn Treebank language modeling
Neural Turing Machines
Alex Graves, Greg Wayne, Ivo Danihelka (2014)
ArXiv
Extends neural networks with external memory that can be read from and written to via differentiable attention. Learns algorithmic tasks like sorting and copying through gradient descent.
Key Learnings:
- Neural networks can learn algorithm-like behaviors if given appropriate memory structures
- Content-based and location-based addressing enable flexible memory access
- Differentiable attention makes external memory trainable with backprop
- Successfully learns copy, sort, and associative recall tasks
- Inspired later memory-augmented architectures (Differentiable Neural Computer)
Advanced Architectures Specialized Neural Modules
These papers introduce specialized architectures for structured data, relational reasoning, and complex input/output mappings.
Pointer Networks
Oriol Vinyals, Meire Fortunato, Navdeep Jauregui-Borda (2015)
ArXiv
Introduces attention mechanism for variable-length output dictionaries. Instead of predicting from fixed vocabulary, the network "points" to input positions.
Key Learnings:
- Attention can serve as output mechanism, not just for reading encoder states
- Solves combinatorial problems: convex hull, Delaunay triangulation, TSP
- Output dictionary size adapts to input length
- Foundation for copy mechanisms in seq2seq models
Neural Message Passing for Quantum Chemistry
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, George E. Dahl (2017)
ArXiv
Unifies various graph neural network architectures under a message passing framework. Demonstrates effectiveness on molecular property prediction.
Key Learnings:
- Graph convolutions generalize CNNs to irregular structures
- Message passing: nodes aggregate information from neighbors iteratively
- Unified framework helps understand GNN variants (GCN, GraphSAGE, etc.)
- Achieved competitive results on quantum chemistry benchmarks
A simple neural network module for relational reasoning
Adam Santoro, David Raposo, David G.T. Barrett, et al. (2017)
ArXiv
Relation Networks (RNs) explicitly compute relations between all pairs of objects. Achieves superhuman performance on CLEVR visual reasoning benchmark.
Key Learnings:
- Relational reasoning requires comparing all object pairs
- Simple architecture: g(o_i, o_j) applied to all pairs, then aggregated
- Dramatically outperforms CNNs on relational tasks
- Demonstrates importance of architectural inductive biases
Relational recurrent neural networks
Adam Santoro, Ryan Faulkner, David Raposo, et al. (2018)
ArXiv
Extends RNNs with relational memory core that performs attention-like operations over memory slots. Improves performance on tasks requiring flexible memory access.
Key Learnings:
- Memory-as-attention: each memory slot attends to all others
- Enables flexible binding and retrieval of relational information
- Outperforms LSTMs on reasoning tasks (bAbI, Mini PacMan)
- Shows benefits of structured memory over single hidden state
Training & Scaling Making Large Models Practical
As models grew larger, new training techniques became essential. These papers address parallelism, stability, and curriculum learning for complex tasks.
GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
Yanping Huang, Youlong Cheng, Ankur Bapna, et al. (2019)
ArXiv
Efficient pipeline parallelism library enabling training of very large models across multiple accelerators. Achieved state-of-the-art on ImageNet and translation.
Key Learnings:
- Pipeline parallelism splits model across devices, processes micro-batches sequentially
- Re-computation during backward pass trades computation for memory
- Enabled training models 25× larger than memory-constrained single-device limit
- Trained 557M parameter AmoebaNet on ImageNet (84.3% top-1 accuracy)
- Nearly linear speedup with up to 8 accelerators
Order Matters: Sequence to sequence for sets
Oriol Vinyals, Samy Bengio, Manjunath Kudlur (2015)
ArXiv
Demonstrates that seq2seq models can learn to handle sets (unordered inputs) by training on multiple random permutations. Solves problems like sorting through learned attention patterns.
Key Learnings:
- Neural networks can learn permutation invariance through data augmentation
- Read-process-write framework with attention handles variable-size inputs/outputs
- Successfully learns sorting and TSP-like problems
- Order of processing affects learning—curriculum matters
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, et al. (2015)
ArXiv
Large-scale production speech recognition system using end-to-end deep learning. Demonstrates that same architecture works across languages with minimal changes.
Key Learnings:
- End-to-end learning (audio → text) outperforms pipeline approaches
- RNNs + CTC loss enable training without phoneme-level alignment
- Data scale matters: 11,940 hours of labeled speech
- Batch normalization critical for training stability
- Same architecture achieves SOTA in both English and Mandarin
Theory & Principles Learning, Compression, and Generalization
These papers explore theoretical foundations: how to balance model complexity with data fit, the connection between compression and learning, and empirical laws governing neural network scaling.
Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
Geoffrey E. Hinton, Drew van Camp (1993)
Paper
Foundational work connecting information theory to neural network learning. Proposes Minimum Description Length (MDL) principle for regularization.
Key Learnings:
- Good models compress both model and data efficiently
- MDL principle: minimize description length of weights + description length of data given weights
- Provides principled foundation for regularization
- Connects to Bayesian inference and PAC learning
- Influenced modern compression-based approaches to generalization
Variational Lossy Autoencoder
Xi Chen, Diederik P. Kingma, Tim Salimans, et al. (2016)
ArXiv
Addresses "posterior collapse" problem in VAEs where decoder ignores latent code. Introduces learnable rate control and improved optimization.
Key Learnings:
- Posterior collapse occurs when decoder is too powerful—ignores latent variables
- Free bits technique: ensure minimum KL divergence per latent dimension
- Balancing reconstruction and regularization critical for meaningful representations
- Demonstrates importance of optimization details in generative models
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, et al. (2020)
ArXiv
Empirical study revealing smooth power-law relationships between model performance and scale (parameters, data, compute). Fundamentally changed how AI labs think about model development.
Key Learnings:
- Performance scales as power law with model size: L(N) ∝ N^(-α)
- Similar power laws hold for dataset size and compute
- Model size and data size should scale together—increasing one without the other suboptimal
- Early stopping based on validation loss is inefficient—train to convergence
- Very large models underperform if undertrained
- Provided blueprint for GPT-3 and subsequent large language models
A Tutorial Introduction to the Minimum Description Length Principle
Peter Grünwald (2004)
ArXiv
Comprehensive introduction to MDL principle for model selection. Connects information theory, statistics, and machine learning through the lens of compression.
Key Learnings:
- MDL principle: best model minimizes total description length (model + data|model)
- Provides unified framework for comparing different model classes
- Naturally handles model complexity vs fit tradeoff
- Two-part code MDL vs sophisticated MDL approaches
- Connections to Kolmogorov complexity and algorithmic information theory
Information Theory & Complexity Fundamental Limits and Philosophical Foundations
These works explore deep questions about information, complexity, and computation. They provide philosophical grounding and connect AI to broader scientific principles.
Kolmogorov Complexity and Algorithmic Randomness
A. Shen, V. A. Uspensky, N. Vereshchagin (2017)
Book
Comprehensive textbook on Kolmogorov complexity—the length of the shortest program that produces a given string. Provides formal foundation for concepts of information and randomness.
Key Learnings:
- Kolmogorov complexity K(x) = length of shortest program producing x
- Incompressible strings are algorithmically random
- K(x) is uncomputable but provides theoretical foundation
- Relates to information theory: K(x) ≈ entropy in many cases
- Provides precise definitions of "simplicity" and "pattern"
- Minimum description length is practical approximation of K(x)
The First Law of Complexodynamics
Scott Aaronson (2011)
Blog
Playful yet profound exploration of how entropy and complexity evolve in physical systems. Discusses why complexity increases then decreases over time.
Key Learnings:
- "Complexity increases, then decreases"—systems evolve from simple to complex to simple
- Connects thermodynamics, computation, and cosmology
- Early universe: low entropy, low complexity
- Middle phase: structure formation increases complexity
- Heat death: maximum entropy, zero complexity
- Provides intuition for why interesting structure exists temporarily
Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton
Scott Aaronson, Sean M. Carroll, Lauren Ouellette (2014)
ArXiv
Concrete model (coffee mixing in cream) demonstrating how complexity evolves in thermodynamic systems. Uses cellular automata to make abstract concepts precise.
Key Learnings:
- Complexity can be quantified using various measures (entropy, pattern length)
- Coffee automaton provides toy model of thermodynamic evolution
- Initial state: low entropy, low complexity (separated coffee and cream)
- Intermediate: high complexity (swirling patterns)
- Final: high entropy, low complexity (uniform mixture)
- Illustrates why structure formation is transient phenomenon
Machine Super Intelligence
Shane Legg (2008)
PhD Thesis
Shane Legg's PhD thesis exploring formal definitions of intelligence and paths to machine superintelligence. Legg would go on to co-found DeepMind with Demis Hassabis.
Key Learnings:
- Universal intelligence defined via performance across all computable environments
- AIXI: theoretical optimal agent (but incomputable)
- Intelligence requires balancing exploration vs exploitation
- Takeoff scenarios: slow vs fast paths to superintelligence
- Safety considerations for advanced AI systems
- Established theoretical framework that influenced AI safety research
Why This Collection Matters
Ilya Sutskever's paper selection reveals his philosophy on learning AI: start with solid foundations (CNNs, RNNs), understand architectural innovations (ResNets, attention), master the transformative Transformer architecture, and ground everything in information theory and first principles.
Unlike many AI curricula that focus solely on techniques, this collection emphasizes:
- Theory: Information theory, complexity, and compression as learning foundations
- Fundamentals: Deep understanding of core architectures before chasing latest trends
- Progression: Building from basic CNNs/RNNs to Transformers to specialized modules
- Philosophy: Questions about intelligence, randomness, and computational limits
This isn't just a technical reading list—it's a window into how one of AI's leading researchers thinks about the field.