GPT Series - Triton 1 (make GPU go brrr)

Motivations Basic GPT-2 Recently, I rewrote GPT2 as an exercice to help me prepare for big AI companies interviews. After reading the paper and reused the Shakespeare dataset given by Karpathy in its nanoGPT project, I started to write the code for the whole model : LayerNorm Attention layer Training loop Feed forward network (FFN) Positional embedding Model improvements I then focused on improving the model by implementing a few features such as : ...

October 6, 2025 · 18 min · 3631 words · Julien Seveno

GPT Series - KV Cache

The KV cache is an important feature in today’s LLM infrastructure. To understand exactly what it brings, let’s recall how LLMs are being used for inference. Introduction Feel free to read my article about Multi-head Self Attention for more explanation about the variations around the Attention layer ! When LLMs are being used in production to generate text, they generate one word at a time. For example, from the following prompt : ...

October 1, 2025 · 6 min · 1211 words · Julien Seveno

GPT Series - Multi-head Self Attention

Motivations Attention is now a key component for most AI systems, wether they are working with images, sequences of tokens in language processing. It has been introduced by one of the most famous papers in deep learning : Attention is All You Need. The idea behind attention is to map two sequences (or the same sequence to itself, called cross attention) and learn how items in the sequences are related to each other. Whether it is to map two sequences of two different languages in the case of translation, or to map tokens from the same sequence to identify links between words such as : ...

September 25, 2025 · 5 min · 985 words · Julien Seveno

GPT Series - Positional Embedding

Positional embedding Motivation As we saw earlier, multi-head self attention layer assigns the same output for every identical token, regardless of their position. This can cause obvious problems in sentences where the same word is used multiple times to represent different entities such as : The red car turned left where the yellow card turned left. The two occurrences of the “car” word represent different actual cars. They cannot be treated the same way. ...

September 16, 2025 · 5 min · 999 words · Julien Seveno

Log Derivation Trick

Introduction Today, let’s talk about reinforcement learning, and more specifically policy-based reinforcement learning. Policy-based reinforcement learning is when we directly parametrize the policy, meaning we are looking for a policy such as : $$ \pi_{\theta}(s, a) = p(a | s, \theta) $$ In other words, we are looking for a function that represents the probability of our agent taking a specific action $a$ in a state $s$. Think about a state as the position on the chess board for instance and the action as the move to be played next. ...

September 3, 2025 · 4 min · 834 words · Julien Seveno

Ai Finetuning Learnings

When fine tuning or even training a model, hardware resources are often the bottleneck, and with today’s model sizes, the limiting factor is often GPU memory. As an example, let’s take a Qwen 2.5 3B models. As the name says, it contains approximately 3 billion parameters. The model available on HuggingFace is saved with bf16, meaning it contains: Sign bit : 1 bit Exponent : 8 bits Significant precision : 7 bits So the total size in memory for 1 parameter among the 3 billion is 16 bits, which is 2 bytes. To store the whole model, the memory will need to be at least 6 billion bytes (6Gb). ...

August 29, 2025 · 8 min · 1579 words · Julien Seveno

Chain-of-Thought is LLMs prompting themselves

Let’s take the following notations: $f_{\theta}: X \to Y$ the LLM parametrized by its weights $\theta$ $X$ the set of tasks (prompts, made of tokens) $Y$ the set of answers to those tasks (made of tokens as well) The best parameters for the model are given by: $$ \theta^* = \argmax_{\theta} f_{\theta}(y | x) \text{ with } x, y \in X, Y $$ When fine tuning a model to think, the model is trained to answer a sequence a tokens in between the prompt and the answer that can be manually curated instead of outputing the final answer straight away. Let’s call this sequence of tokens $c \in C$. The optimal weights are now given by: ...

July 11, 2025 · 2 min · 261 words · Julien Seveno

Nemotron

Nemotron: Advancing Tool-Calling LLMs with Rule-Based Reinforcement Learning 🤖 Large Language Models (LLMs) are becoming increasingly powerful, and their ability to interact with external tools and APIs significantly expands their capabilities. Nvidia’s Nemotron paper introduces an innovative approach to training LLMs for more effective tool use, focusing on a rule-based reinforcement learning (RL) pipeline. This method aims to overcome the common hurdle of requiring large, meticulously curated datasets, allowing models to learn optimal tool-calling strategies more autonomously. ...

June 5, 2025 · 7 min · 1393 words · Julien Seveno

Flash Attention

Introduction Transformers have revolutionized the field of machine learning, emerging as the dominant architectural choice across various applications. However, their reliance on self-attention mechanisms introduces significant computational challenges, particularly due to quadratic time and memory complexity relative to sequence length. While approximate solutions exist, their limited adoption stems from an overemphasis on theoretical FLOP counts rather than practical performance metrics. In 2022, a paper introduced a way to compute the attention result by only working on sub vectors to reduce memory I/O. ...

May 18, 2025 · 6 min · 1131 words · Julien Seveno

Webformers

Introduction Extracting structured information from web pages remains a challenging task in natural language processing. Regular transformers architecture are not designed to encode hierarchical information. Each token is connected to every tokens in the input sequence, regardless of their position (even though there are a few mechanisms to introduce positional information, such as positional encoding). In a webpage, information is highly structured. The HTML represents a tree, with each node having a parent and potential siblings and children. This makes that some nodes might be semantically connected while relatively far away from each other if we consider only the number of tokens between them. ...

April 16, 2025 · 8 min · 1492 words · Julien Seveno