Reinforcement learning (RL) is a fundamental and specific learning method.
DeepSeek-R1's efficiency, a hybrid model combining RL and LLMs, will likely accelerate the adoption of RL methods.
In the standard supervised, unsupervised, self-supervised frameworks, the learner receives the data and learns a model from them.
In the RL framework, the learner is more active as he/she is an
agent interacting in an unsupervised manner with its
environment through a sequence of
actions. In response to the action just taken, the agent receives two types of information through a standard feedback mechanism:
- Its current state in the environment,
- A real-valued immediate reward for its actions. In more advanced frameworks, future or long-term reward are provided. The trade-off between immmediate and future reward is controled by the discount factor $\gamma \in (0,1)$.
The agent takes into account this feedback for its future actions.
The objective of the agent is to maximize its reward and thus to select the best course of actions, the
policy, for maximizing the cumulative reward over time. If the environment is known, this is a
planning problem, whilst if the environment is unknow, this is a
learning problem: the agent learns a sequence of decisions, i.e., a policy.
In the latter case, the agent balances the trade-off between the exploration, i.e., obtaining more information on the environmemt and rewards, and the exploitation of the information already collected for optimizing the rewards.
The standard framework for formalizing the reinforcement learning is the Markov decision process (MDP), defined by $M=(\mathcal{S,A,P},r,\gamma )$ where $\mathcal{S,A}$ are respectively the state space and the action space, which might be infinite, $r : \mathcal{S\times A} \rightarrow [0,1]$ is the reward function, i.e., $r(s,a)$ is the immediate reward upon choosing the action $a \in \mathcal{A}$ while in the state $s \in \mathcal{S}.$
The probability transition kernel is defined as $\mathcal{P}: \mathcal{S \times A} \rightarrow \Delta(S)$, where $\mathcal{P}(s^{'}|s,a)$ is the probability of transiting from state $s$ to state $s^{'}$ when $a$ is selected; $\Delta(S)$ is the probability simplex over $\mathcal{S}$. In the generative case, the probability transition kernel $\mathcal{P}$ is empirically estimated from the data.
The model is Markovian since the transition and reward probabilities depend only on the current state $s$ and not the entire history of states and actions. In the standard tabular case, the cardinalities $|\mathcal{S}|$ and $|\mathcal{A}|$, i.e., the number of states and actions respectively, are finite.
A deterministic policy $\pi$ is a mapping $\pi: \mathcal{S}\rightarrow \mathcal{A}$.
The objective of the agent is to find the optimal policy $\pi^\star$ maximizing the value function $V^\pi(s)$ , which is the expected discounted total reward starting from the initial state $s^0=s$ , i.e.,
\begin{eqnarray*}
\pi^\star &:= &\arg \max_\pi V^\pi(s),\\
V^\pi(s)& := &E \Big[ \sum_{t=0}^\infty \gamma^t r(s^t, a^t)\big | s^0 = s \Big ].
\end{eqnarray*}
The corresponding action value function $Q^\pi: \mathcal{S\times A} \rightarrow \mathbb{R}$ of a policy $\pi$ is defined
as
$$
Q^\pi(s,a) := E \Big [ \sum_{t=0}^\infty \gamma^t r(s^t, a^t) \big | (s^0, a^0)=(s, a) \Big ],
$$
$\forall (s,a) \in \mathcal{S \times A}$. There exists an optimal policy $\pi^\star$ that simultaneoulsy maximizes $V^\pi(s) \; \forall s \in \mathcal{S}$, and $Q^\pi(s,a) \; \forall (s,a) \in \mathcal {S \times A}$.
This MDP framework is flexible enough to deal with several configurations, e.g., off-policy RL where data logged from a different policy are used in the exploration step, but rigorous enough as it allows to assess the reliability of the results, e.g., we obtain confidence intervals for the rewards. Furthermore, this setup allows to provides guarantees on the reliability of the algorithms used for estimating the optimal value function $V^\star := V^{\pi^\star}$,
or optimal action value function $Q^\star := Q^{\pi^\star}$.
The MDP framework is particularly valuable for analyzing the sample complexity of reinforcement learning algorithms. This means we can study how efficiently these algorithms learn from potentially limited data.
New results on the sample complexity and the minimum sample size for achieving the minimax sample complexity are given in the
AI & Math Blog.
This MDP framework has been extended to the non-tabular case, with a very large number of states, and/or continuous states, see e.g., Long and Han (2022, 2023).
Generative AI is a part of automation technologies: it generates text, programming code and images using as database quite the exhaustive content of the web, using attention mechanisms, transformer architectures and autoregressive probabilistic models.
These combinatorial generative models, based on the linguistic representation of the physical world, are working inside tha representation, and then are directly unable to plan and produce any content outside that box.
From a mathematical perspective, LLMs often lack complete deductive closure, as they may not always derive every possible conclusion from their given information set. The output of LLMs is akin to Kahneman's (2011) Type 1 system.
As a consequence, the output of these generative models must be checked, e.g., the generated code must be verified by an interpreter/compiler, the multi-step reasonning of a maths solver must be verified (Cobbe et al., 2021).
Recent research suggests that enhancing LLMs with reasoning and planning capabilities akin to Kahneman's (2011) Type 2 system can significantly improve their reasoning abilities.
In particular, Chervonyi et al. (2025) developed a new fast algorithm called the deductive database arithmetic reasoning (DDAR 2). The DDAR employs a fixed set of deduction rules to iteratively add new elements to the initial information set until its deduction closure is reached. The DDAR 2 algorithm successively solves gold medal-level olympiad problems in geometry, and improves the results obtained with the previous version (DDAR 1) presented by Trinh et al. (2024).
Alternatively, the reasoning performance of LLMs cand be extended and improved with the Chain-of-Thought (CoT) prompting (Wei et al., 2022). CoT is an extension of the few-shots prompting properties of LLMs (Brown et al., 2020):
for the few-shot prompting, the LLM is provided with a small number of in-context (input, output) pairs of examples, or 'shots'. In contrast, for the CoT prompting, the LLM is provided with triples (input, chain-of-thought, output), where the chain-of-thought consists of a series of intermediate reasoning steps that lead to the solution of the studied problem.
The CoT consists of breaking down complex tasks into smaller, intermediate and more manageable steps. These intermediate steps, which make the decision-making process more transparent, can be particularly useful in tasks that require multiple steps of reasoning, where understanding the model's reasoning is crucial.
Furthermore, this decomposition in elementary steps can reduce the likelihood of generating irrelevant or incorrect information, known as hallucinations. CoT reduces the cognitive load of processing complex information all at once by distributing the problem-solving process over multiple steps, allowing the model to handle each step more effectively and carrying
intermediate verification of each step in the reasoning process. This means that errors can be identified and corrected earlier, preventing them from propagating and affecting the final conclusion. RL can be combined with CoT for reducing the hallucinations.
Prompting the model consists of generating a series of thoughts or steps that lead to the final answer, rather than directly producing the solution. This approach helps the model to handle more complex, multi-step problems by mimicking human-like logical step-by-step reasoning leading to the final answer. This requires a careful prompt engineering, i.e., carefully designing and refining specific instructions or questions that guide the model's responses in a way that aligns with the user's goals or requirements. Poorly designed prompts can lead to suboptimal performance.
Furthermore, CoT prompting does not require extensive retraining.
The single pathway CoT prompting has been further generalized by Tree of Thoughts (ToT) prompting (Yao et al., 2024), which is characterized by multiple competing reasoning paths modeled as a tree. The formalization of a human problem solving as a search problem along a tree has been advocated by the Turing Award laureates Newell and Simon (1972).
Amazon Bedrock is the standard framework used for our Generative AI services.