GPT Series - Positional Embedding
Positional embedding Motivation As we saw earlier, multi-head self attention layer assigns the same output for every identical token, regardless of their position. This can cause obvious problems in sentences where the same word is used multiple times to represent different entities such as : The red car turned left where the yellow card turned left. The two occurrences of the “car” word represent different actual cars. They cannot be treated the same way. ...
Log Derivation Trick
Introduction Today, let’s talk about reinforcement learning, and more specifically policy-based reinforcement learning. Policy-based reinforcement learning is when we directly parametrize the policy, meaning we are looking for a policy such as : $$ \pi_{\theta}(s, a) = p(a | s, \theta) $$ In other words, we are looking for a function that represents the probability of our agent taking a specific action $a$ in a state $s$. Think about a state as the position on the chess board for instance and the action as the move to be played next. ...
Ai Finetuning Learnings
When fine tuning or even training a model, hardware resources are often the bottleneck, and with today’s model sizes, the limiting factor is often GPU memory. As an example, let’s take a Qwen 2.5 3B models. As the name says, it contains approximately 3 billion parameters. The model available on HuggingFace is saved with bf16, meaning it contains: Sign bit : 1 bit Exponent : 8 bits Significant precision : 7 bits So the total size in memory for 1 parameter among the 3 billion is 16 bits, which is 2 bytes. To store the whole model, the memory will need to be at least 6 billion bytes (6Gb). ...
Chain-of-Thought is LLMs prompting themselves
Let’s take the following notations: $f_{\theta}: X \to Y$ the LLM parametrized by its weights $\theta$ $X$ the set of tasks (prompts, made of tokens) $Y$ the set of answers to those tasks (made of tokens as well) The best parameters for the model are given by: $$ \theta^* = \argmax_{\theta} f_{\theta}(y | x) \text{ with } x, y \in X, Y $$ When fine tuning a model to think, the model is trained to answer a sequence a tokens in between the prompt and the answer that can be manually curated instead of outputing the final answer straight away. Let’s call this sequence of tokens $c \in C$. The optimal weights are now given by: ...
Nemotron
Nemotron: Advancing Tool-Calling LLMs with Rule-Based Reinforcement Learning 🤖 Large Language Models (LLMs) are becoming increasingly powerful, and their ability to interact with external tools and APIs significantly expands their capabilities. Nvidia’s Nemotron paper introduces an innovative approach to training LLMs for more effective tool use, focusing on a rule-based reinforcement learning (RL) pipeline. This method aims to overcome the common hurdle of requiring large, meticulously curated datasets, allowing models to learn optimal tool-calling strategies more autonomously. ...
Flash Attention
Introduction Transformers have revolutionized the field of machine learning, emerging as the dominant architectural choice across various applications. However, their reliance on self-attention mechanisms introduces significant computational challenges, particularly due to quadratic time and memory complexity relative to sequence length. While approximate solutions exist, their limited adoption stems from an overemphasis on theoretical FLOP counts rather than practical performance metrics. In 2022, a paper introduced a way to compute the attention result by only working on sub vectors to reduce memory I/O. ...
Webformers
Introduction Extracting structured information from web pages remains a challenging task in natural language processing. Regular transformers architecture are not designed to encode hierarchical information. Each token is connected to every tokens in the input sequence, regardless of their position (even though there are a few mechanisms to introduce positional information, such as positional encoding). In a webpage, information is highly structured. The HTML represents a tree, with each node having a parent and potential siblings and children. This makes that some nodes might be semantically connected while relatively far away from each other if we consider only the number of tokens between them. ...
Kolmogorov AI Framework | Part2 - brainfuck
Motivations I recently started a series of article about a framework I was thinking about, which allows agent to be trained in a reinforcement learning basis, executing one action at a time on specific environments. The first part of the series can be found here: Kolmogorov AI Framework. Environment To build this proof of concept, I chose the brainfuck programming language. It is part of the esoteric programming languages family, but it is made of only 6 instructions, making it very easy to start with as an execution environment. ...
EPITA Courses - Continuous Physics Informed Neural Networks
Introduction Neural networks require large amounts of data to converge. Those data need to represent the task the neural network is trying to learn. Data collection is a tedious process, especially when collecting data can be difficult or expensive. In science, physics for instance, many phenomenon are described using theories that we know are working very well. Using those data as regularization can help neural networks generalize better with less data. ...
EPITA Courses - Transformers
Context Generating data is now a hot topic in machine learning. The idea of using statistical methods to produce synthetic data is rather old. Many methods are proven to be effective in different scenarios. Today, the most well-known ways to generate synthetic data are: VAE GAN Transformers Transformers A bit of history We talked about RNN last week and we saw how they can be used to predict sequences. Unfortunately, RNN suffer some problems, especially with long sequences where they seem to forget what happened. ...