Posts

Scraping - Refreshing Frequency

Scraping is a topic I like. Gathering data and making sure they are fresh is an interesting problem. Among the many things that you have to do when scraping, making sure your data is fresh is an important one. Let’s give a few tips to our fellow scraper. Problem definition We are building a intelligent (hopefully) system that optimizes when we should visit an URL. We don’t want to spend too much resource on refreshing too often, and we want our data to be as fresh as possible. ...

GPT Series - Triton 1 (make GPU go brrr)

Motivations Basic GPT-2 Recently, I rewrote GPT2 as an exercice to help me prepare for big AI companies interviews. After reading the paper and reused the Shakespeare dataset given by Karpathy in its nanoGPT project, I started to write the code for the whole model : LayerNorm Attention layer Training loop Feed forward network (FFN) Positional embedding Model improvements I then focused on improving the model by implementing a few features such as : ...

GPT Series - KV Cache

The KV cache is an important feature in today’s LLM infrastructure. To understand exactly what it brings, let’s recall how LLMs are being used for inference. Introduction Feel free to read my article about Multi-head Self Attention for more explanation about the variations around the Attention layer ! When LLMs are being used in production to generate text, they generate one word at a time. For example, from the following prompt : ...

GPT Series - Multi-head Self Attention

Motivations Attention is now a key component for most AI systems, wether they are working with images, sequences of tokens in language processing. It has been introduced by one of the most famous papers in deep learning : Attention is All You Need. The idea behind attention is to map two sequences (or the same sequence to itself, called cross attention) and learn how items in the sequences are related to each other. Whether it is to map two sequences of two different languages in the case of translation, or to map tokens from the same sequence to identify links between words such as : ...

GPT Series - Positional Embedding

Positional embedding Motivation As we saw earlier, multi-head self attention layer assigns the same output for every identical token, regardless of their position. This can cause obvious problems in sentences where the same word is used multiple times to represent different entities such as : The red car turned left where the yellow card turned left. The two occurrences of the “car” word represent different actual cars. They cannot be treated the same way. ...

Log Derivation Trick

Introduction Today, let’s talk about reinforcement learning, and more specifically policy-based reinforcement learning. Policy-based reinforcement learning is when we directly parametrize the policy, meaning we are looking for a policy such as : $$ \pi_{\theta}(s, a) = p(a | s, \theta) $$ In other words, we are looking for a function that represents the probability of our agent taking a specific action $a$ in a state $s$. Think about a state as the position on the chess board for instance and the action as the move to be played next. ...

Ai Finetuning Learnings

When fine tuning or even training a model, hardware resources are often the bottleneck, and with today’s model sizes, the limiting factor is often GPU memory. As an example, let’s take a Qwen 2.5 3B models. As the name says, it contains approximately 3 billion parameters. The model available on HuggingFace is saved with bf16, meaning it contains: Sign bit : 1 bit Exponent : 8 bits Significant precision : 7 bits So the total size in memory for 1 parameter among the 3 billion is 16 bits, which is 2 bytes. To store the whole model, the memory will need to be at least 6 billion bytes (6Gb). ...

Chain-of-Thought is LLMs prompting themselves

Let’s take the following notations: $f_{\theta}: X \to Y$ the LLM parametrized by its weights $\theta$ $X$ the set of tasks (prompts, made of tokens) $Y$ the set of answers to those tasks (made of tokens as well) The best parameters for the model are given by: $$ \theta^* = \argmax_{\theta} f_{\theta}(y | x) \text{ with } x, y \in X, Y $$ When fine tuning a model to think, the model is trained to answer a sequence a tokens in between the prompt and the answer that can be manually curated instead of outputing the final answer straight away. Let’s call this sequence of tokens $c \in C$. The optimal weights are now given by: ...

Nemotron

Nemotron: Advancing Tool-Calling LLMs with Rule-Based Reinforcement Learning 🤖 Large Language Models (LLMs) are becoming increasingly powerful, and their ability to interact with external tools and APIs significantly expands their capabilities. Nvidia’s Nemotron paper introduces an innovative approach to training LLMs for more effective tool use, focusing on a rule-based reinforcement learning (RL) pipeline. This method aims to overcome the common hurdle of requiring large, meticulously curated datasets, allowing models to learn optimal tool-calling strategies more autonomously. ...

Flash Attention

Introduction Transformers have revolutionized the field of machine learning, emerging as the dominant architectural choice across various applications. However, their reliance on self-attention mechanisms introduces significant computational challenges, particularly due to quadratic time and memory complexity relative to sequence length. While approximate solutions exist, their limited adoption stems from an overemphasis on theoretical FLOP counts rather than practical performance metrics. In 2022, a paper introduced a way to compute the attention result by only working on sub vectors to reduce memory I/O. ...