Let’s take the following notations:
- $f_{\theta}: X \to Y$ the LLM parametrized by its weights $\theta$
- $X$ the set of tasks (prompts, made of tokens)
- $Y$ the set of answers to those tasks (made of tokens as well)
The best parameters for the model are given by:
$$ \theta^* = \argmax_{\theta} f_{\theta}(y | x) \text{ with } x, y \in X, Y $$
When fine tuning a model to think, the model is trained to answer a sequence a tokens in between the prompt and the answer that can be manually curated instead of outputing the final answer straight away. Let’s call this sequence of tokens $c \in C$. The optimal weights are now given by:
$$ \theta^* = \argmax_{\theta} f_{\theta}(y, c | x) $$
But we know that:
$$ p(y, c | x) = p(c | x) \times p(y | x, c) $$
Here, $f$ is actually trained to be the density probability of the next token to appear knowing the previous tokens so $f \sim p$ if we allow these aggressive notations:
- $f(c | x)$ is the probability of the model generating the intermediary sequence based on the prompt
- $f(y | x, c)$ is the probability of the model generating the right answer based on the prompt and the intermediary sequence
So if we rename $x’ \leftarrow x \circ c$ (the concatenation of both the input tokens and the intermediary tokens), then we indeed have:
$$ f(y | x’) $$
that is the probability of the model to output the right answer based on a prompt it crafted itself.