Artificial Intelligence

A paper-centric guide to artificial intelligence — from the first neuron model in 1943 to the large language models of today.

Modern AI was not invented in a single lab or a single year. It grew through decades of papers — each one answering a stubborn question, opening a new door, or slamming one shut. This guide is organized around those papers: the ideas that drove the field forward, the dead ends that redirected it, and the breakthroughs that changed everything.

How to use this guide

Each section below covers an era. The landmark papers are highlighted with the question they set out to answer. Click through to dedicated pages for in-depth explanations and code. Use Fundamental Papers for the complete annotated chronological reading list.


Era 1 · Origins of Neural Computation (1943–1969)

Can a machine think? Can it learn?

The story begins with a deceptively simple question: can networks of simple units compute? Two decades of work establish both the promise and the hard limits of artificial neurons.

Year Paper Key contribution
1943 McCulloch & Pitts — A Logical Calculus… First mathematical model of a neuron
1958 Rosenblatt — The Perceptron First learning algorithm for neural networks
1969 Minsky & Papert — Perceptrons Proof of fundamental limits → first AI Winter

Deep dives:


Era 2 · The Backpropagation Revolution (1986–1991)

How can a deep network learn?

After a decade of winter, a rediscovered algorithm reignites the field. Backpropagation makes it possible to train multi-layer networks — and the ideas from this era still underpin every neural network trained today.

Year Paper Key contribution
1986 Rumelhart, Hinton & Williams Backpropagation — gradient-based training for deep nets
1989 LeCun et al. Convolutional Neural Networks — deep learning applied to vision
1989 Cybenko Universal Approximation Theorem — networks can approximate any function
1991 Bottou Stochastic Gradient Descent as the standard optimizer

Concepts:


Era 3 · Scaling & Representation (1997–2012)

Can deep networks handle language and massive images?

LSTMs tame long sequences. Word embeddings encode meaning as vectors. Then in 2012, AlexNet wins ImageNet by a landslide and the deep learning era truly begins.

Year Paper Key contribution
1997 Hochreiter & Schmidhuber LSTM — memory cells for long sequences
2003 Bengio et al. Word embeddings — distributed representations of language
2012 Krizhevsky, Sutskever & Hinton AlexNet — GPU-trained CNNs dominate ImageNet
2012 Mikolov et al. Word2Vec — semantic vector arithmetic

Concepts:


Era 4 · Attention & the Transformer (2014–2020)

What if we dropped recurrence entirely?

A single architecture — the Transformer — rewrites NLP, then vision, then science. The attention mechanism becomes the universal building block.

Year Paper Key contribution
2014 Bahdanau et al. Attention mechanism — alignment between encoder and decoder
2015 He et al. ResNet — residual connections enable 100+ layer networks
2017 Vaswani et al. Attention Is All You Need — the Transformer architecture
2018 Devlin et al. BERT — bidirectional pre-training for language
2020 Brown et al. GPT-3 — few-shot learning at scale

Concepts:


Era 5 · Foundation Models & Reasoning (2021–present)

Can scale alone produce intelligence?

Scaling laws, instruction tuning, RLHF, and chain-of-thought prompting transform large language models from text predictors into reasoning systems.

Year Paper Key contribution
2021 Kaplan et al. Scaling Laws — loss as a power law of compute
2022 Ouyang et al. InstructGPT / RLHF — aligning LLMs to human intent
2023 Touvron et al. Llama — open-weight foundation models
2025 DeepSeek-AI DeepSeek-R1 — reasoning via reinforcement learning

Concepts:


Resources

Online Courses