Byte Pair Encoding

Byte Pair Encoding is a data compression algorithm published in 1994 by Philip Gage in an article for the C Users Journal. Thirty years later, with one modification, it became the tokenizer behind GPT, LLaMA, and almost every modern language model.

Understanding how it works means understanding how models turn raw text into something they can actually process. In this article we will implement BPE from scratch in PHP — training and encoding included.


The tokenizer problem

Language models do not read text. They read sequences of integers. Before a model can process the sentence "the cat sat on the mat", that sentence must be converted into something like [1037, 14287, 232, 519, 14920]. This is the job of the tokenizer: transform strings into sequences of numeric IDs, and convert them back to text on the way out.

The question is: how is this mapping defined? There are three fundamental approaches.

Character-level: every character is a token. A tiny vocabulary (26 letters plus punctuation), but sequences become enormous. The word "internationalization" is already 20 tokens. Models struggle to capture linguistic patterns across sequences that long.

Word-level: every word is a token. Short sequences, but the vocabulary explodes — inflected languages have millions of word forms. Worse, words never seen during training cannot be represented at all.

Subword: the compromise. Tokens represent word fragments — suffixes, prefixes, stems. A manageable vocabulary (32,000–100,000 tokens) that covers any word by decomposing it into its pieces. BPE is the most widely used algorithm for building this vocabulary.


The core idea: compression

Before talking about NLP, it is worth understanding what BPE does in its original form, because the principle is identical.

BPE operates on a sequence of symbols. It finds the most frequent pair of adjacent symbols, replaces every occurrence with a new symbol, and saves the substitution rule. It repeats until it reaches the desired number of rules.

Example on a character sequence:

Input:   a a b c a a b d a a b c

Step 1:  most frequent pair → (a, a)  →  replace with X
         X b c X b d X b c

Step 2:  most frequent pair → (X, b)  →  replace with Y
         Y c Y d Y c

Step 3:  most frequent pair → (Y, c)  →  replace with Z
         Z Y d Z

The original 12-symbol sequence has become 4. The rule dictionary — (a,a)→X, (X,b)→Y, (Y,c)→Z — is sufficient to reconstruct the original. This is BPE as a compressor.


The adaptation for language models

In 2016, Rico Sennrich and colleagues published Neural Machine Translation of Rare Words with Subword Units, which adapted BPE to the tokenization problem. The changes from the original are minimal but fundamental.

A different starting point. Instead of raw bytes or arbitrary characters, BPE for NLP starts from corpus words decomposed into their characters — each word is represented as a sequence of space-separated characters.

The end-of-word marker. Each word receives the suffix </w> before being decomposed. Without this marker, the token low extracted from "lower" would be indistinguishable from the token low standing on its own. The </w> preserves word boundaries and lets the model understand whether a token is a suffix, a prefix, or a complete word.

The output is rules, not compressed symbols. The goal is not to compress the corpus: it is to learn an ordered set of merge rules that define the model's vocabulary.


Training step by step

Suppose we have a minimal corpus: the words nation, station, and ration, each with frequency 1. The first step is to decompose each word into space-separated characters with </w> appended:

n a t i o n </w>   frequency: 1
s t a t i o n </w>   frequency: 1
r a t i o n </w>   frequency: 1

We count all adjacent symbol pairs across the entire corpus, weighted by the frequency of the word containing them:

Pair Frequency
a, t 3
t, i 3
i, o 3
o, n 3
n, </w> 3
n, a 1
s, t 1
t, a 1
r, a 1

Five pairs are tied at frequency 3. We take (a, t) and merge it into at:

n at i o n </w>
s t at i o n </w>
r at i o n </w>

The most frequent pair is now (at, i) with 3. Merge into ati:

n ati o n </w>
s t ati o n </w>
r ati o n </w>

Then (ati, o)atio:

n atio n </w>
s t atio n </w>
r atio n </w>

Then (atio, n)ation, then (ation, </w>)ation</w>:

n ation</w>
s t ation</w>
r ation</w>

In five merges, BPE has discovered on its own that -ation is a shared element and turned it into an atomic token. Nobody told it what a suffix was: it emerged from the corpus frequencies.

The most important property appears when we encode words never seen during training:

'nation'   → ['nation</w>']
'station'  → ['s', 't', 'ation</w>']
'ration'   → ['r', 'ation</w>']
'creation' → ['c', 'r', 'e', 'ation</w>']
'fashion'  → ['f', 'a', 's', 'h', 'i', 'o', 'n', '</w>']

"Creation" was never in the training corpus, yet it is handled correctly: the model sees a prefix c r e and the already-known token ation</w>. "Fashion" shares no learned subwords — it falls back entirely to character level. The model never crashes on an out-of-vocabulary word: it decomposes it, down to individual characters in the worst case.


The PHP implementation

The implementation consists of four functions: building the initial vocabulary, counting pairs, applying a merge, and the main training loop.

buildVocab()

Takes raw text and returns an associative array where the keys are the space-separated character representations (with </w>) and the values are frequencies.

function buildVocab(string $text): array
{
    $words = preg_split('/\s+/', strtolower(trim($text)));
    $vocab = [];

    foreach ($words as $word) {
        if ($word === '') continue;
        // "dog" becomes "d o g </w>"
        $key = implode(' ', str_split($word)) . ' </w>';
        $vocab[$key] = ($vocab[$key] ?? 0) + 1;
    }

    return $vocab;
}

getPairFrequencies()

Scans the current vocabulary and counts every pair of adjacent symbols, weighted by the frequency of the word containing them. We use | as an internal separator to avoid confusion with the space that separates symbols in the vocabulary representation.

function getPairFrequencies(array $vocab): array
{
    $pairs = [];

    foreach ($vocab as $word => $freq) {
        $symbols = explode(' ', $word);
        $n = count($symbols);

        for ($i = 0; $i < $n - 1; $i++) {
            $key = $symbols[$i] . '|' . $symbols[$i + 1];
            $pairs[$key] = ($pairs[$key] ?? 0) + $freq;
        }
    }

    return $pairs;
}

mergePair()

Applies a single merge rule to the entire vocabulary. Scans each word symbol by symbol and fuses the pair ($a, $b) every time it appears, advancing to the right without backtracking.

function mergePair(string $a, string $b, array $vocab): array
{
    $merged   = $a . $b;
    $newVocab = [];

    foreach ($vocab as $word => $freq) {
        $symbols = explode(' ', $word);
        $result  = [];
        $i       = 0;

        while ($i < count($symbols)) {
            if (
                $i < count($symbols) - 1
                && $symbols[$i]     === $a
                && $symbols[$i + 1] === $b
            ) {
                $result[] = $merged;
                $i += 2;
            } else {
                $result[] = $symbols[$i];
                $i++;
            }
        }

        $newWord            = implode(' ', $result);
        $newVocab[$newWord] = ($newVocab[$newWord] ?? 0) + $freq;
    }

    return $newVocab;
}

train()

The main loop. At each iteration it finds the most frequent pair using arsort(), saves it as a merge rule in discovery order, updates the vocabulary, and repeats.

function train(string $text, int $numMerges): array
{
    $vocab  = buildVocab($text);
    $merges = [];

    for ($i = 0; $i < $numMerges; $i++) {
        $pairs = getPairFrequencies($vocab);
        if (empty($pairs)) break;

        arsort($pairs);
        $bestKey = array_key_first($pairs);
        [$a, $b] = explode('|', $bestKey);

        $vocab    = mergePair($a, $b, $vocab);
        $merges[] = [$a, $b];
    }

    return $merges;
}

Encoding

Given the merge rules learned during training, encode() applies them — in the exact order they were discovered — to a single word. Order is critical: each rule was learned in the context of all previous rules, so applying them in a different sequence produces wrong tokens.

function encode(string $word, array $merges): array
{
    // Start from the character-level representation + </w>
    $symbols = array_merge(str_split($word), ['</w>']);

    foreach ($merges as [$a, $b]) {
        $merged = $a . $b;
        $i      = 0;

        while ($i < count($symbols) - 1) {
            if ($symbols[$i] === $a && $symbols[$i + 1] === $b) {
                array_splice($symbols, $i, 2, [$merged]);
            } else {
                $i++;
            }
        }
    }

    return $symbols;
}

function encodeText(string $text, array $merges): array
{
    $words  = preg_split('/\s+/', strtolower(trim($text)));
    $tokens = [];

    foreach ($words as $word) {
        if ($word === '') continue;
        $tokens = array_merge($tokens, encode($word, $merges));
    }

    return $tokens;
}

A complete example

<?php

$corpus = <<<TEXT
the dog barks the cat meows the cat runs the dog runs
the dog eats the cat eats the cat drinks the dog drinks
TEXT;

$merges = train($corpus, 20);

echo "Learned merge rules:\n";
foreach ($merges as $i => [$a, $b]) {
    printf("  %2d. %-14s + %-14s → %s\n", $i + 1, "'$a'", "'$b'", "'$a$b'");
}

echo "\nEncoding:\n";
$words = ['dog', 'cat', 'meows', 'barks', 'runs', 'eats', 'drinks', 'wolf'];
foreach ($words as $word) {
    $tokens = encode($word, $merges);
    $fmt    = implode(', ', array_map(fn($t) => "'$t'", $tokens));
    echo "  '$word' → [$fmt]\n";
}

Output:

Learned merge rules:
   1. 't'            + 'h'            → 'th'
   2. 'th'           + 'e'            → 'the'
   3. 'the'          + '</w>'         → 'the</w>'
   4. 's'            + '</w>'         → 's</w>'
   5. 'a'            + 't'            → 'at'
   6. 'd'            + 'o'            → 'do'
   7. 'do'           + 'g'            → 'dog'
   8. 'dog'          + '</w>'         → 'dog</w>'
   9. 'c'            + 'at'           → 'cat'
  10. 'cat'          + '</w>'         → 'cat</w>'
  11. 'k'            + 's</w>'        → 'ks</w>'
  12. 'r'            + 'u'            → 'ru'
  13. 'ru'           + 'n'            → 'run'
  14. 'run'          + 's</w>'        → 'runs</w>'
  15. 'e'            + 'at'           → 'eat'
  16. 'eat'          + 's</w>'        → 'eats</w>'
  17. 'd'            + 'r'            → 'dr'
  18. 'dr'           + 'i'            → 'dri'
  19. 'dri'          + 'n'            → 'drin'
  20. 'drin'         + 'ks</w>'       → 'drinks</w>'

Encoding:
  'dog'     → ['dog</w>']
  'cat'     → ['cat</w>']
  'meows'   → ['m', 'e', 'o', 'w', 's</w>']
  'barks'   → ['b', 'a', 'r', 'ks</w>']
  'runs'    → ['runs</w>']
  'eats'    → ['eats</w>']
  'drinks'  → ['drinks</w>']
  'wolf'    → ['w', 'o', 'l', 'f', '</w>']

Frequent corpus words — dog, cat, runs, eats, drinks — become single tokens. Rare words — meows, barks — are partially decomposed, reusing already-learned tokens like s</w> and ks</w>. The word never seen at all — wolf — falls back completely to character level. No exception, no unknown token: BPE guarantees that any text is representable, at worst one character at a time.


From token to string, from string to number

The tokens produced by encode() are still strings. In production, each token is mapped to an integer — its ID — through a vocabulary built at the end of training. The vocabulary contains the base characters of the corpus (the algorithm's starting point) plus every token generated by a merge rule.

// Builds the vocabulary and assigns an ID to every token
function buildTokenizer(array $merges, string $corpus): array
{
    // Base characters: all unique characters in the corpus + </w>
    preg_match_all('/./u', strtolower($corpus), $m);
    $base = array_values(array_unique(array_filter($m[0], fn($c) => trim($c) !== '')));
    $base[] = '</w>';
    sort($base);

    // Tokens generated by merge rules
    $mergeTokens = array_map(fn([$a, $b]) => $a . $b, $merges);

    $vocab = array_values(array_unique(array_merge($base, $mergeTokens)));
    return array_flip($vocab); // token → id
}

$tokenizer = buildTokenizer($merges, $corpus);

$tokens = encode('dog', $merges);
$ids    = array_map(fn($t) => $tokenizer[$t], $tokens);

echo implode(', ', $ids); // 19

Decoding is the reverse process: given an array of IDs, map each one back to its token with array_flip($tokenizer), then strip the </w> markers and rejoin the words. All the complexity lives in the training phase — encoding and decoding are lookup operations.


Vocabulary size in production models

The final vocabulary is the sum of the base characters plus the number of learned merge rules. With 300 base characters and 10,000 merges, you get ~10,300 tokens. Real models use far larger sizes:

Model Vocabulary size
GPT-2 50,257
LLaMA 2 32,000
LLaMA 3 128,256
GPT-4 ~100,000

A larger vocabulary means shorter sequences — "internationalization" might become 3 tokens instead of 20 — which reduces the amount of context the model must process to understand a sentence. But it also means a larger embedding matrix in memory. The number of merge rules is a hyperparameter that balances these two factors.

One important detail: production models use byte-level BPE, where the base vocabulary is not Unicode characters but all 256 possible bytes. This guarantees that any string — any language, any emoji, any byte sequence — is always representable without </w> markers and without falling back to unknown characters.


Conclusion

BPE has solved one of NLP's fundamental problems elegantly: how to represent an open vocabulary with a finite and manageable set of tokens. The solution does not come from a sophisticated mathematical formula, but from a greedy algorithm on frequencies, applied iteratively.

The fascinating thing is that nobody taught the algorithm what morphemes, prefixes, or suffixes are. It discovered them on its own, by counting pairs. Linguistics emerged from statistics.