Artificial Intelligence
A paper-centric guide to artificial intelligence — from the first neuron model in 1943 to the large language models of today.
Modern AI was not invented in a single lab or a single year. It grew through decades of papers — each one answering a stubborn question, opening a new door, or slamming one shut. This guide is organized around those papers: the ideas that drove the field forward, the dead ends that redirected it, and the breakthroughs that changed everything.
Each section below covers an era. The landmark papers are highlighted with the question they set out to answer. Click through to dedicated pages for in-depth explanations and code. Use Fundamental Papers for the complete annotated chronological reading list.
Era 1 · Origins of Neural Computation (1943–1969)
Can a machine think? Can it learn?
The story begins with a deceptively simple question: can networks of simple units compute? Two decades of work establish both the promise and the hard limits of artificial neurons.
| Year | Paper | Key contribution |
|---|---|---|
| 1943 | McCulloch & Pitts — A Logical Calculus… | First mathematical model of a neuron |
| 1958 | Rosenblatt — The Perceptron | First learning algorithm for neural networks |
| 1969 | Minsky & Papert — Perceptrons | Proof of fundamental limits → first AI Winter |
Deep dives:
- Perceptron (Rosenblatt, 1958) — the algorithm that started it all, with Python implementation
Era 2 · The Backpropagation Revolution (1986–1991)
How can a deep network learn?
After a decade of winter, a rediscovered algorithm reignites the field. Backpropagation makes it possible to train multi-layer networks — and the ideas from this era still underpin every neural network trained today.
| Year | Paper | Key contribution |
|---|---|---|
| 1986 | Rumelhart, Hinton & Williams | Backpropagation — gradient-based training for deep nets |
| 1989 | LeCun et al. | Convolutional Neural Networks — deep learning applied to vision |
| 1989 | Cybenko | Universal Approximation Theorem — networks can approximate any function |
| 1991 | Bottou | Stochastic Gradient Descent as the standard optimizer |
Concepts:
- Deep Learning — neurons, layers, backpropagation, architectures
- Optimizers — SGD, Adam, learning rate schedules
- Loss Functions — MAE, MSE, cross-entropy
- Activation Functions — ReLU, sigmoid, softmax
- Tensors — the fundamental data structure
Era 3 · Scaling & Representation (1997–2012)
Can deep networks handle language and massive images?
LSTMs tame long sequences. Word embeddings encode meaning as vectors. Then in 2012, AlexNet wins ImageNet by a landslide and the deep learning era truly begins.
| Year | Paper | Key contribution |
|---|---|---|
| 1997 | Hochreiter & Schmidhuber | LSTM — memory cells for long sequences |
| 2003 | Bengio et al. | Word embeddings — distributed representations of language |
| 2012 | Krizhevsky, Sutskever & Hinton | AlexNet — GPU-trained CNNs dominate ImageNet |
| 2012 | Mikolov et al. | Word2Vec — semantic vector arithmetic |
Concepts:
- Machine Learning — supervised vs. unsupervised, bias-variance
- Cross-Validation — robust model evaluation
- Parsnip — unified modeling in R (tidymodels)
Era 4 · Attention & the Transformer (2014–2020)
What if we dropped recurrence entirely?
A single architecture — the Transformer — rewrites NLP, then vision, then science. The attention mechanism becomes the universal building block.
| Year | Paper | Key contribution |
|---|---|---|
| 2014 | Bahdanau et al. | Attention mechanism — alignment between encoder and decoder |
| 2015 | He et al. | ResNet — residual connections enable 100+ layer networks |
| 2017 | Vaswani et al. | Attention Is All You Need — the Transformer architecture |
| 2018 | Devlin et al. | BERT — bidirectional pre-training for language |
| 2020 | Brown et al. | GPT-3 — few-shot learning at scale |
Concepts:
- Transformers — self-attention, positional encoding, encoder-decoder
- Foundation Models — Llama, Gemma, Mistral, DeepSeek
- HuggingFace Transformers —
pipeline()API in practice
Era 5 · Foundation Models & Reasoning (2021–present)
Can scale alone produce intelligence?
Scaling laws, instruction tuning, RLHF, and chain-of-thought prompting transform large language models from text predictors into reasoning systems.
| Year | Paper | Key contribution |
|---|---|---|
| 2021 | Kaplan et al. | Scaling Laws — loss as a power law of compute |
| 2022 | Ouyang et al. | InstructGPT / RLHF — aligning LLMs to human intent |
| 2023 | Touvron et al. | Llama — open-weight foundation models |
| 2025 | DeepSeek-AI | DeepSeek-R1 — reasoning via reinforcement learning |
Concepts:
- Foundation Models
- Learning Paradigms — where statistical, ML, and deep learning meet
Resources
- Fundamental Papers — full annotated chronological reading list (1943–2025)
- References — textbooks and links
- Jupyter — notebook tips and workflows
- Kaggle — competitions and datasets