The Journey of a Token: What Really Happens Inside a Transformer

On this article, you’ll find out how a transformer transforms an enter token right into a context-aware illustration and finally transforms the likelihood of the following token.

Matters lined embody:

How tokenization, embedding, and site data put together the inputs How multihead consideration and feedforward networks contribute inside every layer How ultimate projection and softmax generate the likelihood of the following token

Let’s begin the journey.

The Token Journey: What’s Actually Taking place Inside a Transformer (Click on to Enlarge)
Picture by editor

The start of the journey

Massive-scale language fashions (LLMs) are based mostly on transformer architectures, that are complicated deep neural networks that take a set of token embeddings as enter. After a deep course of that appears like a parade of many stacked attentions and feedforward transformations, the output is a likelihood distribution that signifies which token needs to be produced subsequent as a part of the mannequin’s response. However how can this course of from enter to output be defined for a single token within the enter sequence?

On this article, you’ll be taught what is going on contained in the Transformer Mannequin (the structure behind LLM) on the token stage. In different phrases, see how an enter token or a part of an enter textual content sequence adjustments right into a generated textual content output, and the rationale behind the adjustments and transformations that happen inside the transformer.

The reason of this course of by a transformer mannequin relies on the diagram above that exhibits the final transformer structure and the way data flows and evolves by it.

Transformer enter: from uncooked enter textual content to enter embedding

Earlier than going into the depth of the transformer mannequin, some transformations have already been finished on the textual content enter, primarily to make sure that it’s represented in a kind that’s absolutely comprehensible by the interior layers of the transformer.

tokenization

The tokenizer is an algorithmic element that usually works symbiotically with the LLM’s transformer mannequin. It takes a uncooked textual content sequence, reminiscent of a person immediate, and breaks it into particular person tokens (typically subword items or bytes, generally even entire phrases), and every token within the supply language is mapped to an identifier i.

Embedding the token

There’s a educated embedding desk E with form |V|. × d (vocabulary measurement by embedding dimension). Looking for identifiers in a sequence of size n yields an embedding matrix X of form n × d. That’s, every token identifier is mapped to a d-dimensional embedding vector that varieties one row of X. Two embedding vectors are related to one another if they’re related to tokens with related meanings (e.g., king and emperor, or vice versa). Importantly, at this stage, every token embedding conveys the semantic and lexical data of that single token with out incorporating details about the remainder of the sequence (at the very least not but).

positional encoding

Earlier than we absolutely enter the core a part of the transformer, we have to inject into every token embedding vector, and due to this fact into every row of the embedding matrix X, details about the place of that token within the sequence. That is additionally referred to as location injection and is usually finished utilizing trigonometric features reminiscent of sine and cosine, though there are additionally methods based mostly on realized location embeddings. The approximate residual element is added to the earlier embedding vector e_t related to the token as follows:

[
x_t^{(0)} = e_t + p_{text{pos}}

p_pos

Now let’s go deeper into the transformer and see what’s going on inside.

Deep inside transformers: from input embeddings to output probabilities

Let’s explain what happens as each “enriched” single-token embedding vector passes through one transformer layer, and then zoom out to explain what happens across the entire stack of layers.

formula

[
h_t^{(0)} = x_t^{(0)}
]

is used to indicate the illustration of a token at layer 0 (the primary layer), however extra usually, ht(l) is used to indicate the embedding illustration of a token at layer l.

Watch out for a number of heads

The primary main element inside every replicated layer of the transformer is multihead consideration. That is in all probability essentially the most influential element in the whole structure in relation to figuring out and incorporating into every token’s illustration lots of significant details about its position within the general sequence and its relationship to different tokens within the textual content, reminiscent of syntactic, semantic, or different kinds of linguistic relationships. The a number of heads of this so-called consideration mechanism are every specialised in concurrently capturing completely different linguistic features and patterns of the whole token and the sequence to which it belongs.

The results of passing the token illustration ht(l) (pre-injected with location data, do not forget!) by this multi-headed consideration inside the layer is a context-enriched or context-aware token illustration. Through the use of residual connections and layer normalization throughout the transformer layers, the newly generated vector turns into a steady mix of its personal earlier illustration and the multihead consideration output. This helps enhance consistency throughout processes which might be utilized iteratively between layers.

feedforward neural community

Subsequent comes a number of comparatively easy feedforward neural community (FFN) layers. For instance, these might be token-wise multilayer perceptrons (MLPs), whose aim is to additional rework and refine the options of the tokens which might be being realized over time.

The principle distinction between the Consideration stage and this stage is that whereas Consideration mixes and incorporates context data throughout all tokens into every token illustration, the FFN step is utilized to every token independently, refining the already built-in context patterns and producing helpful “information” from them. These layers are additionally complemented by residual connections and layer normalization, and this course of ends in an up to date illustration ht(l+1) on the finish of the transformation layer, which turns into the enter to the following transformation layer, thereby coming into one other multi-head consideration block.

This whole course of is repeated for as many stacked layers as outlined by the structure, regularly enriching the tokens with higher-level, summary, and long-range linguistic data embedded behind the seemingly indecipherable numbers.

ultimate vacation spot

So what occurs on the finish? On the prime of the stack, after passing by the final replicated transformer layer, we receive the ultimate token illustration ht*(L) (the place t* denotes the present predicted place), which is projected by the linear output layer and subsequent softmax.

The linear layer produces denormalized scores referred to as logits, and softmax converts these logits into chances of the following token.

Calculating logit:

[
text{logits}_j = W_{text{vocab}, j} cdot h_{t^*}^{(L)} + b_j
]

Compute normalized chances by making use of softmax.

[
text{softmax}(text{logits})_j = frac{exp(text{logits}_j)}{sum_{k} exp(text{logits}_k)}
]

Use the softmax output because the likelihood of the following token:

[
P(text{token} = j) = text{softmax}(text{logits})_j
]

These chances are computed for all potential tokens within the vocabulary. The subsequent token generated by LLM is then chosen. Essentially the most possible token is usually chosen, however sampling-based decoding methods are additionally widespread.

finish of the journey

This text gives a common understanding, with a light stage of technical element, of what occurs to the textual content supplied to an LLM (essentially the most outstanding mannequin based mostly on the Transformer structure) by the Transformer structure, and the way this textual content is processed and remodeled inside the mannequin on the token stage, finally producing the mannequin’s output: the following phrase.

We hope you get pleasure from your journey with us. We stay up for the chance to take you on one other journey within the close to future.

The Journey of a Token: What Really Happens Inside a Transformer

The start of the journey

Transformer enter: from uncooked enter textual content to enter embedding

tokenization

Embedding the token

positional encoding

Deep inside transformers: from input embeddings to output probabilities

Watch out for a number of heads

feedforward neural community

ultimate vacation spot

finish of the journey

Leave a Reply Cancel reply

Follow US

Popular News

Squanch Games Really Wants Danny DeVito to Play a Gun in High on Life 3

Halo Is Officially Starting A Bold New Chapter

Best foldable 2025: the best folding phones you can buy

Marathon may actually have a shot as Server Slam hits it big on Steam

Top Sci-Fi TV Picks for Every Major Streaming Platform

Categories

About US

Quick Links

Important Links

Subscribe US

The start of the journey

Transformer enter: from uncooked enter textual content to enter embedding

tokenization

Embedding the token

positional encoding

Deep inside transformers: from input embeddings to output probabilities

Watch out for a number of heads

feedforward neural community

ultimate vacation spot

finish of the journey

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Popular News

Squanch Games Really Wants Danny DeVito to Play a Gun in High on Life 3

Halo Is Officially Starting A Bold New Chapter

Best foldable 2025: the best folding phones you can buy

Marathon may actually have a shot as Server Slam hits it big on Steam

Top Sci-Fi TV Picks for Every Major Streaming Platform

Categories

About US

Quick Links

Important Links

Subscribe US