Ilya's Favorite Papers
A Curated Learning Path for AI

Curated by Ilya Sutskever (Former Chief Scientist, OpenAI)
Recommended to John Carmack in 2019

About This Collection

In 2019, Ilya Sutskever—former Chief Scientist at OpenAI and one of the leading figures of the Google Brain era—sent computer scientist John Carmack a carefully curated list of ~40 papers to learn about AI. This collection represents a masterclass in AI education from one of the field's most influential researchers.

These papers span the foundations of deep learning, from convolutional networks and recurrent architectures to attention mechanisms and transformers. They also venture into information theory, complexity science, and philosophical foundations of artificial intelligence. Together, they form a comprehensive curriculum for understanding modern AI.

🎯 How to Use This Guide

This collection is organized into thematic sections, progressing from foundational concepts to advanced techniques and theoretical principles. Each paper includes a brief summary and key learnings to help you extract the most important insights. Whether you're a beginner or experienced practitioner, you can follow the path sequentially or jump to sections that interest you most.

Foundations Building Blocks of Deep Learning

Start here to understand the core concepts that revolutionized computer vision and established deep learning as a dominant paradigm. These papers demonstrate how neural networks learn hierarchical representations and why depth matters.
CS231n: Convolutional Neural Networks for Visual Recognition
Stanford University Course
Course Website

Stanford's flagship deep learning course covering convolutional neural networks from first principles. Provides hands-on implementation experience with backpropagation, optimization, and modern architectures.

Key Learnings:

ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012)
Paper

AlexNet—the breakthrough that launched the deep learning revolution. Won ImageNet 2012 by a massive margin, demonstrating that deep CNNs could dramatically outperform traditional computer vision methods.

Key Learnings:

Understanding LSTM Networks
Christopher Olah (2015)
Blog Post

The definitive visual explanation of Long Short-Term Memory networks. Breaks down the architecture's gates and cell states with intuitive diagrams, making complex concepts accessible.

Key Learnings:

Core Architectures Modern Network Design

These papers introduced architectural innovations that became standard practice. ResNets solved the degradation problem in very deep networks, while dilated convolutions enabled efficient receptive field expansion.
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015)
ArXiv

ResNets revolutionized deep learning by introducing skip connections, enabling networks with 100+ layers. Won ImageNet 2015 and became the most influential architecture of the 2010s.

Key Learnings:

Identity Mappings in Deep Residual Networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2016)
ArXiv

Follow-up analysis revealing that "clean" identity paths (pre-activation ResNets) enable even deeper and more accurate networks by ensuring unimpeded gradient flow.

Key Learnings:

Multi-Scale Context Aggregation by Dilated Convolutions
Fisher Yu, Vladlen Koltun (2015)
ArXiv

Introduced dilated (atrous) convolutions for dense prediction tasks. Enables exponential receptive field expansion without losing resolution or adding parameters.

Key Learnings:

Attention & Transformers The Revolution in Sequence Modeling

Attention mechanisms fundamentally changed how models process sequences. These papers trace the evolution from additive attention in NMT to the self-attention of Transformers—the architecture that powers modern LLMs.
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (2014)
ArXiv

Introduced attention mechanisms for sequence-to-sequence models. Solved the bottleneck problem where encoders had to compress entire sentences into fixed-size vectors.

Key Learnings:

Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. (2017)
ArXiv

The Transformer architecture—arguably the most important paper in modern AI. Replaced recurrence with self-attention, enabling parallel training and scaling to billions of parameters.

Key Learnings:

The Annotated Transformer
Sasha Rush, et al. (2018)
Blog Code

Line-by-line implementation guide to the Transformer paper. Combines paper walkthrough with working PyTorch code, making the architecture accessible and reproducible.

Key Learnings:

Recurrent Networks Sequential Processing and Memory

Before Transformers dominated, RNNs were the workhorse for sequential data. These papers explore their capabilities, limitations, and extensions with external memory.
The Unreasonable Effectiveness of Recurrent Neural Networks
Andrej Karpathy (2015)
Blog Code

Influential blog post demonstrating the surprising capabilities of character-level RNNs. Shows how simple models can learn complex structures like code syntax and LaTeX formatting.

Key Learnings:

Recurrent Neural Network Regularization
Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals (2014)
ArXiv Code

Demonstrates that dropout in RNNs should only be applied to non-recurrent connections. Established best practices for regularizing recurrent models.

Key Learnings:

Neural Turing Machines
Alex Graves, Greg Wayne, Ivo Danihelka (2014)
ArXiv

Extends neural networks with external memory that can be read from and written to via differentiable attention. Learns algorithmic tasks like sorting and copying through gradient descent.

Key Learnings:

Advanced Architectures Specialized Neural Modules

These papers introduce specialized architectures for structured data, relational reasoning, and complex input/output mappings.
Pointer Networks
Oriol Vinyals, Meire Fortunato, Navdeep Jauregui-Borda (2015)
ArXiv

Introduces attention mechanism for variable-length output dictionaries. Instead of predicting from fixed vocabulary, the network "points" to input positions.

Key Learnings:

Neural Message Passing for Quantum Chemistry
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, George E. Dahl (2017)
ArXiv

Unifies various graph neural network architectures under a message passing framework. Demonstrates effectiveness on molecular property prediction.

Key Learnings:

A simple neural network module for relational reasoning
Adam Santoro, David Raposo, David G.T. Barrett, et al. (2017)
ArXiv

Relation Networks (RNs) explicitly compute relations between all pairs of objects. Achieves superhuman performance on CLEVR visual reasoning benchmark.

Key Learnings:

Relational recurrent neural networks
Adam Santoro, Ryan Faulkner, David Raposo, et al. (2018)
ArXiv

Extends RNNs with relational memory core that performs attention-like operations over memory slots. Improves performance on tasks requiring flexible memory access.

Key Learnings:

Training & Scaling Making Large Models Practical

As models grew larger, new training techniques became essential. These papers address parallelism, stability, and curriculum learning for complex tasks.
GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
Yanping Huang, Youlong Cheng, Ankur Bapna, et al. (2019)
ArXiv

Efficient pipeline parallelism library enabling training of very large models across multiple accelerators. Achieved state-of-the-art on ImageNet and translation.

Key Learnings:

Order Matters: Sequence to sequence for sets
Oriol Vinyals, Samy Bengio, Manjunath Kudlur (2015)
ArXiv

Demonstrates that seq2seq models can learn to handle sets (unordered inputs) by training on multiple random permutations. Solves problems like sorting through learned attention patterns.

Key Learnings:

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, et al. (2015)
ArXiv

Large-scale production speech recognition system using end-to-end deep learning. Demonstrates that same architecture works across languages with minimal changes.

Key Learnings:

Theory & Principles Learning, Compression, and Generalization

These papers explore theoretical foundations: how to balance model complexity with data fit, the connection between compression and learning, and empirical laws governing neural network scaling.
Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
Geoffrey E. Hinton, Drew van Camp (1993)
Paper

Foundational work connecting information theory to neural network learning. Proposes Minimum Description Length (MDL) principle for regularization.

Key Learnings:

Variational Lossy Autoencoder
Xi Chen, Diederik P. Kingma, Tim Salimans, et al. (2016)
ArXiv

Addresses "posterior collapse" problem in VAEs where decoder ignores latent code. Introduces learnable rate control and improved optimization.

Key Learnings:

Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, et al. (2020)
ArXiv

Empirical study revealing smooth power-law relationships between model performance and scale (parameters, data, compute). Fundamentally changed how AI labs think about model development.

Key Learnings:

A Tutorial Introduction to the Minimum Description Length Principle
Peter Grünwald (2004)
ArXiv

Comprehensive introduction to MDL principle for model selection. Connects information theory, statistics, and machine learning through the lens of compression.

Key Learnings:

Information Theory & Complexity Fundamental Limits and Philosophical Foundations

These works explore deep questions about information, complexity, and computation. They provide philosophical grounding and connect AI to broader scientific principles.
Kolmogorov Complexity and Algorithmic Randomness
A. Shen, V. A. Uspensky, N. Vereshchagin (2017)
Book

Comprehensive textbook on Kolmogorov complexity—the length of the shortest program that produces a given string. Provides formal foundation for concepts of information and randomness.

Key Learnings:

The First Law of Complexodynamics
Scott Aaronson (2011)
Blog

Playful yet profound exploration of how entropy and complexity evolve in physical systems. Discusses why complexity increases then decreases over time.

Key Learnings:

Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton
Scott Aaronson, Sean M. Carroll, Lauren Ouellette (2014)
ArXiv

Concrete model (coffee mixing in cream) demonstrating how complexity evolves in thermodynamic systems. Uses cellular automata to make abstract concepts precise.

Key Learnings:

Machine Super Intelligence
Shane Legg (2008)
PhD Thesis

Shane Legg's PhD thesis exploring formal definitions of intelligence and paths to machine superintelligence. Legg would go on to co-found DeepMind with Demis Hassabis.

Key Learnings:

Why This Collection Matters

Ilya Sutskever's paper selection reveals his philosophy on learning AI: start with solid foundations (CNNs, RNNs), understand architectural innovations (ResNets, attention), master the transformative Transformer architecture, and ground everything in information theory and first principles.

Unlike many AI curricula that focus solely on techniques, this collection emphasizes:

This isn't just a technical reading list—it's a window into how one of AI's leading researchers thinks about the field.