Neural Language Models

In 2003, Yoshua Bengio’s team showed you could replace count-based language models with neural networks that learn distributed representations of words. This paper (Bengio et al., 2003, “A Neural Probabilistic Language Model”) quietly invented the modern field.

The architecture

Take the previous n−1 words. Map each to a dense vector via a shared embedding matrix. Concatenate. Feed through an MLP. Output a softmax over the vocabulary.

Inputs: w_{t-3}, w_{t-2}, w_{t-1}
↓ (shared embedding)
e_{t-3}, e_{t-2}, e_{t-1}    (each ~50–300 dim)
↓ (concatenate)
[e_{t-3} ; e_{t-2} ; e_{t-1}]
↓ (MLP with tanh)
hidden vector
↓ (linear + softmax)
P(w_t | history) over vocabulary

Trained to maximize log-probability of the next word. Standard cross-entropy.

What changed

n-gram models treated words as atomic symbols. The neural model represents each word as a dense vector — and similar words end up with similar vectors because the model learns to predict similar contexts for them.

So when “rabbit” appears in a context where “bunny” never appeared, the model can still generalize, because their embeddings are close.

This is distributed representation — the central idea behind every modern embedding (Stage 05).

The vocabulary projection problem

The output layer is softmax(W · h) where W has size (vocab_size, hidden_dim). For a 50k-word vocabulary and 500-dim hidden, that’s 25M parameters in just the output.

Computing the full softmax over a large vocabulary is expensive. Workarounds:

Hierarchical softmax — tree-structured prediction, O(log V) per step.
Negative sampling / NCE — only contrast with a sample of negatives.
Sub-word units — smaller effective vocabulary (Stage 05’s tokenization story).

In transformer-era LLMs, large output projections are still expensive but tolerable on modern hardware. Modern models often tie embeddings (input embedding = transposed output projection) to halve params.

Word2Vec (2013)

A simplification: drop the hidden layer; predict context words from the center word (or vice versa) using just embeddings + a softmax.

Two variants:

CBOW (Continuous Bag of Words): predict center from context.
Skip-gram: predict context from center.

Trained with negative sampling — instead of full softmax, contrast each positive pair with a few random negatives.

L = log σ(e_w · e_c) + Σ_neg log σ(−e_w · e_neg)

Word2Vec famously produces embeddings where:

king − man + woman ≈ queen
Paris − France + Italy ≈ Rome

Linear analogy works because the geometry of embeddings reflects compositional semantics.

Word2Vec is no longer state-of-the-art (contextual embeddings beat it for most tasks), but it’s how most people first learn that “embeddings are vectors with semantic geometry” — that intuition transfers to everything else.

GloVe (2014)

Global Vectors for Word Representation. Uses corpus-wide co-occurrence statistics rather than local context windows. Roughly comparable quality to Word2Vec; different mathematical lineage (matrix factorization vs predictive modeling).

FastText

Word2Vec with character n-grams. Each word’s embedding is a sum of its sub-word embeddings, so:

Out-of-vocabulary words still get sensible embeddings.
Morphologically rich languages benefit.

A direct ancestor of modern subword tokenization.

The core insight from this era

You can train a neural network to predict words and as a side effect end up with embeddings that capture meaning. This isn’t a clever trick; it’s what the predict-the-next-word objective forces.

That same objective scaled up — to sentences, documents, eventually billions of tokens — eventually produced modern LLMs. The current state of AI is in many ways just “Bengio 2003 with massively more compute.”

Limits of feedforward neural LMs

Fixed window: same problem as n-gram models. Information older than n−1 tokens is invisible.
No sequential structure modeled directly: the order of words within the window is captured only by position in the concatenation — clumsy.

The fix: recurrent networks (next article).