MarkupLM - Debugging ML Part 1

When training models, we sometimes encounter strange cases. For example, an obvious sample that is misclassified. This makes us wondering how come a model that is apparently doing ok on a whole benchmark, can fail at such an easy case. And it makes us realize that our model, or our benchmark, might not be as reliable as we thought. Which is very scary. In this article, we are going to talk about how to understand, what the model is actually learning, on what kind of features it might be relying to to achieve the task that we are training it for. The difficult thing is that a model is made up millions, and now very often billions of parameters, so looking at them each is not an option. ...

January 8, 2026 · 14 min · 2779 words · Julien Seveno

Integrated Gradients - Debugging ML Part 2

If you did not read the first part, it is available here https://bornlex.github.io/posts/markuplm/ ! The main content can be understood on its own, but for the practical examples, reading the first part will make things easier. Now that the inner workings of the model are clearer, let’s actually talk about debugging it. Debugging a machine learning model is not easy, but there are a few things we can do. In this article, we are going to talk about two of them : ...

January 8, 2026 · 14 min · 2860 words · Julien Seveno

Scraping - Refreshing Frequency

Scraping is a topic I like. Gathering data and making sure they are fresh is an interesting problem. Among the many things that you have to do when scraping, making sure your data is fresh is an important one. Let’s give a few tips to our fellow scraper. Problem definition We are building a intelligent (hopefully) system that optimizes when we should visit an URL. We don’t want to spend too much resource on refreshing too often, and we want our data to be as fresh as possible. ...

October 23, 2025 · 12 min · 2424 words · Julien Seveno

GPT Series - Triton 1 (make GPU go brrr)

Motivations Basic GPT-2 Recently, I rewrote GPT2 as an exercice to help me prepare for big AI companies interviews. After reading the paper and reused the Shakespeare dataset given by Karpathy in its nanoGPT project, I started to write the code for the whole model : LayerNorm Attention layer Training loop Feed forward network (FFN) Positional embedding Model improvements I then focused on improving the model by implementing a few features such as : ...

October 6, 2025 · 18 min · 3631 words · Julien Seveno

GPT Series - KV Cache

The KV cache is an important feature in today’s LLM infrastructure. To understand exactly what it brings, let’s recall how LLMs are being used for inference. Introduction Feel free to read my article about Multi-head Self Attention for more explanation about the variations around the Attention layer ! When LLMs are being used in production to generate text, they generate one word at a time. For example, from the following prompt : ...

October 1, 2025 · 6 min · 1211 words · Julien Seveno

GPT Series - Multi-head Self Attention

Motivations Attention is now a key component for most AI systems, wether they are working with images, sequences of tokens in language processing. It has been introduced by one of the most famous papers in deep learning : Attention is All You Need. The idea behind attention is to map two sequences (or the same sequence to itself, called cross attention) and learn how items in the sequences are related to each other. Whether it is to map two sequences of two different languages in the case of translation, or to map tokens from the same sequence to identify links between words such as : ...

September 25, 2025 · 5 min · 985 words · Julien Seveno

GPT Series - Positional Embedding

Positional embedding Motivation As we saw earlier, multi-head self attention layer assigns the same output for every identical token, regardless of their position. This can cause obvious problems in sentences where the same word is used multiple times to represent different entities such as : The red car turned left where the yellow card turned left. The two occurrences of the “car” word represent different actual cars. They cannot be treated the same way. ...

September 16, 2025 · 5 min · 999 words · Julien Seveno

Log Derivation Trick

Introduction Today, let’s talk about reinforcement learning, and more specifically policy-based reinforcement learning. Policy-based reinforcement learning is when we directly parametrize the policy, meaning we are looking for a policy such as : $$ \pi_{\theta}(s, a) = p(a | s, \theta) $$ In other words, we are looking for a function that represents the probability of our agent taking a specific action $a$ in a state $s$. Think about a state as the position on the chess board for instance and the action as the move to be played next. ...

September 3, 2025 · 4 min · 834 words · Julien Seveno

Ai Finetuning Learnings

When fine tuning or even training a model, hardware resources are often the bottleneck, and with today’s model sizes, the limiting factor is often GPU memory. As an example, let’s take a Qwen 2.5 3B models. As the name says, it contains approximately 3 billion parameters. The model available on HuggingFace is saved with bf16, meaning it contains: Sign bit : 1 bit Exponent : 8 bits Significant precision : 7 bits So the total size in memory for 1 parameter among the 3 billion is 16 bits, which is 2 bytes. To store the whole model, the memory will need to be at least 6 billion bytes (6Gb). ...

August 29, 2025 · 8 min · 1579 words · Julien Seveno

Chain-of-Thought is LLMs prompting themselves

Let’s take the following notations: $f_{\theta}: X \to Y$ the LLM parametrized by its weights $\theta$ $X$ the set of tasks (prompts, made of tokens) $Y$ the set of answers to those tasks (made of tokens as well) The best parameters for the model are given by: $$ \theta^* = \argmax_{\theta} f_{\theta}(y | x) \text{ with } x, y \in X, Y $$ When fine tuning a model to think, the model is trained to answer a sequence a tokens in between the prompt and the answer that can be manually curated instead of outputing the final answer straight away. Let’s call this sequence of tokens $c \in C$. The optimal weights are now given by: ...

July 11, 2025 · 2 min · 261 words · Julien Seveno