LLMs, Transformers and a little math—Oh My!

Generative AI is powered by an advanced technology that you may have heard of but might not fully understand—a Large Language Model (LLM).

At the core of these LLMs is a breakthrough technology introduced by Google in 2017 called the Transformer. But what exactly is a Transformer?

Think back to a time before AI chatbots were common—maybe you’ve used Google Translate to convert text from one language to another. Imagine you’re at an Italian restaurant, and the menu is in Italian. With Google Translate, you can instantly see the English version. The Transformer technology made this process much more effective by allowing AI to “pay attention” to key parts of a sentence, understanding the context and meaning of words in a more sophisticated way.

This ability to focus on relevant context dramatically improved both the accuracy and efficiency of language translation—and it didn’t stop there. The same underlying technology now powers today’s most advanced AI models, enabling them to generate human-like text, answer questions, and even create new content with remarkable fluency.

In this article, we have a look at some of the internal mechanisms of the transformer architecture that underlies LLM technology along with a bit of the math needed to understand those mechanisms.

Step 1: Tokenization – Mapping Text to Numbers

LLMs process massive amounts of text, and they do it efficiently using parallel computing—the same technology that powers graphics cards in video games. Therefore, the text must be converted into some form of number representation for efficient processing.

The first step in this process is tokenization, which transforms words into smaller, meaningful units called tokens. Two simple approaches would be, on the one hand, using individual characters as basic units or tokens, versus, on the other hand, using whole words as tokens. Both have downsides—character-based tokens require processing too many elements, while word-based tokens struggle with words outside the known vocabulary.

A smart compromise is sub-word tokenization, where common character sequences (like “ing” or “the”) are grouped into tokens. This balances efficiency, reducing storage needs while keeping important context intact.

The second step in the process is numericalization. This process is simple. You just put the tokens into a table and number the entries where, from then on, you would refer to each token by its index in the table.

Step 2: Embeddings – Storing Token Characteristics

An embedding is a vector representation of a semantic or syntactic characteristics of a token. Each decimal number in the vector would represent some learned aspect or feature of the token. However, the meaning of the characteristics or features is learned and stored in the embedding vector and may not be intuitively meaningful.

Instead, it may be better to think of an embedding as vector as a graphically represented arrow in space. For a little math review, you might remember plotting 3D arrows on a graph by taking each dimension as a coordinate in the graph; a three-dimensional vector would have (x, y, z) coordinates, and the arrow would be drawn from the graph’s origin (0, 0, 0) to the point (x, y, z) represented by the vector.

Thinking about embeddings as graphical vectors in space is useful because it allows you to think about how they might be similar or different. Two arrows with similar length and pointing in similar directions would represent tokens that have similar semantic meaning and vice versa. You only have to realize the dimensions may be 256, 1024, 2048—nothing you would practically plot on a graph. But you can see the point. And mathematically, if the vectors are normalized (each number having a value between 0 and 1 and magnitude of 1), the same simple operation for finding the angle between two vectors (inner product) can be used to compute their similarity with precision.

Note that when choosing a pre-trained model, you must be careful to use the same embedding model for your input as the one used to train the model you choose!

Step 3: Self Attention – Queries, Keys, and Values

The next phase of the Transformer architecture involves self-attention. That is, what tokens should the model focus on when trying to perform a task (e.g., summarization, text completion, filling in the blanks)?

During the training process, the model itself learns what it is you’re trying to ask it—queries, what the key characteristics are that would help it to answer—keys, and what aspects of answers might be best—values. Now, this is really interesting, isn’t it? Rather than, encoding explicitly what queries a given task is asking for, the model itself learns what queries a given task is really asking!

It learns these using an artificial neural network or ANN. You can think of an ANN as a learned mathematical function that is trained on a set of inputs to produce a set of expected outputs for each input. The result allows inputs that it hasn’t seen before to produce a reasonable expected output based on what it has seen during training. The mathematical function is represented by a matrix of weights designated by the letter ‘W’. The ANNs for the learned queries, keys, and values are therefore represented by Wq, Wk, and Wv, respectively.

The attention layer might have between 100 and 200 separate QKV ANNs, called heads, that are configured to learn independently. Each head takes a subset of embeddings in the input token vector. Using mathematical terminology, this is a called a linear projection. (When you “project” from 3D to 2D, you essentially drop one of the dimensions.) So, by learning hundreds of different aspects of the problem, these can be combined and matched against an input to get the best context for completing the task.

Finally, you can understand the QKV equations that represent process of the attention layer in the transformer architecture. If X represents the embeddings matrix of the input, then, Q=XWq​, K=XWq​,V=XWv.

Step 4-5. What’s the next word?

The model’s final stage produces a vector of logits—raw scores for each token—which are then converted into a probability distribution. As the transformer generates text, it selects the next token based on the highest probability tokens, ensuring that the choice best fits the evolving context of the query

Why are they called logits? The term ‘logit’ historically comes from the log-odds used in binary logistic regression (an algorithm for solving binary classification problems); the logistic (sigmoid) function inverts that in the binary case. In multi-class settings, like in transformers, these pre-normalized scores—logits—are transformed into probabilities using the softmax function. The softmax function is a generalization of the logistic (sigmoid) function that normalizes the logits into probabilities.

For each token t in the vocabulary, the logit is computed as the dot product of a final hidden state representing a context-rich, learned weight vector corresponding to token t, plus a learned bias term. For efficiency, this is typically done as a single matrix multiplication via a weight matrix often viewed as having a column per token. The weights used here are often tied to the embedding matrix.

While a common approach is to select the word with the highest probability, other strategies—such as sampling with temperature adjustments or top-k—are also used to generate more varied and human-like text. The softmax parameter temperature (T) is a common user-configurable parameter for LLMs. It scales the logits making their differences more (T < 1) or less (T > 1) pronounced, causing the highest probability to be more or less obvious. This makes the choice of “next word” more or less random.

Until next time…

That’s my brief summary of how Transformers and LLMs work. I hope you found it interesting. Until next time…

Leave a comment