Introduction

The current technology behind machine learning (ML) and artificial intelligence (AI) relies on the use of advanced algorithms and computational power for handling very large datasets to train models, based on very large neural networks, with up to trillions of hyperparameters, to enable intelligent behavior.

The empirical astonishing performance of these massive neural networks is not fortuitous, but has strong mathematical foundations: these large learning systems verify two very useful and powerful properties, demonstrated by Belkin and its co-authors and exposed in the AI & Math page.

The set of learning systems used in Cat.IA-AI.Innovations draw from three classes of learning models:
  1. The supervised, semi-supervised, unsupervised, and self-supervised learning models,
  2. The reinforcement learning (RL) models, with a large mathematical apparitus providing proved guarantees on the performance of its algorithms,
  3. The generative large language models (LLM), enhanced by adding reasoning abilities.


Supervised, unsupervised and self-supervised learning

  1. Supervised learning is the simplest learning method. We have a dataset $ \mathcal{D} = \{(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)\} $, which consists of a collection of $N$ pairs of inputs, $x_i$ or features, and outputs $ y_i$, written in matrix form as $ \mathcal{D} =(X,Y)$. The output $Y$ might be discrete, continuous, a set of labels in the case of the classification problem, etc.
    The goal of supervised learning is to approximate the function $f$ that maps $X$ to $Y$, i.e., $Y=f(X)$, by learning this function from the dataset $\mathcal{D}$.

    The estimation of the function $f$ is obtained by minimizing a loss function: $$ \min_{f \in \mathcal{H}} \mathcal{L}(Y,X;\theta). $$ where $\mathcal{H}$ is the space of candidate functions, $\theta \in \Theta$ the set of hyperparameters to be estimated. The function $f$ is approximated by using neural networks, which are universal approximators. The hyperparameters are adjusted on a training set, the model is then tested and validated on different datasets. Further details on the minimization of the loss function, and the recent advances in this field for very large systems are given in the AI & Math page.

  2. However, labels are either costly to obtain, or at least partially available. Furthermore, errors in the labels affect the result of supervised learning and require either to use specific corrective tools (Northcutt et al., 2021), or to design alternative methods for estimating the model without depending on labels.
    Unsupervised learning takes into account the internal structure/organisation of the data, and does not need the set of labels. This approach is similar to the non-parametric approach in statistics which 'let the data speak for themselves'.
    Semi-supervised learning is the intermediate case between supervised and unsupervised learning: the learning procedure uses first the labelled data while leveraging a larger amount of unlabeled data, which strongly enhance the accuracy of the model.

  3. In self-supervised learning, the model generates its own supervisory signals from the underlying structure/organisation of the data. The model learns meaningful representation or features of the data using pretext tasks, contrastive learning, redundancy reduction (Zbontar et al.,2021), etc.


Reinforcement learning

Reinforcement learning (RL) is a fundamental learning method at the intersection of artificial intelligence and machine learning. In the standard supervised, unsupervised, self-supervised frameworks, the learner receives the data and learns a model from them.
In the RL framework, the learner is more active: it is an agent interacting with its environment through a sequence of actions. In response to the action just taken, the agent receives two types of information through a standard feedback mechanism:
  1. Its current state in the environment,
  2. A real-valued immediate reward for its actions. No future or long-term reward is provided.
The agent takes into account this feedback for its future actions. The objective of the agent is to maximize its reward and thus to select the best course of actions, the policy, for maximizing the cumulative reward over time. If the environment is known, this is a planning problem, whilst if the environment is unknow, this is a learning problem.
In the latter case, the agent balances the trade-off between the exploration, i.e., obtaining more information on the environmemt and rewards, and the exploitation of the information already collected for optimizing the rewards.

The standard framework for formalizing the reinforcement learning is the Markov decision process (MDP), defined by $M=(\mathcal{S,A,P},r,\gamma )$ where $\mathcal{S,A}$ are respectively the state space and the action space, which might be infinite, $r : \mathcal{S\times A} \rightarrow [0,1]$ is the reward function, i.e., $r(s,a)$ is the immediate reward upon choosing the action $a \in \mathcal{A}$ while in the state $s \in \mathcal{S}$.
The probability transition kernel is defined as $\mathcal{P}: \mathcal{S \times A} \rightarrow \Delta(S)$, where $\mathcal{P}(s^{'}|s,a)$ is the probability of transiting from state $s$ to state $s^{'}$ when $a$ is selected; $\Delta(S)$ is the probability simplex over $\mathcal{S}$.
The model is Markovian since the transition and reward probabilities depend only on the current state $s$ and not the entire history of states and actions.

A deterministic policy $\pi$ is a mapping $\pi: \mathcal{S}\rightarrow \mathcal{A}$. The objective of the agent is to find the optimal policy $\pi^\star$ maximizing the value function $V^\pi(s)$ , which is the expected discounted total reward starting from the initial state $s^0=s$ , i.e., \begin{eqnarray*} \pi^\star &:= &\arg \max_\pi V^\pi(s),\\ V^\pi(s)& := &E \Big[ \sum_{t=0}^\infty \gamma^t r(s^t, a^t)\big | s^0 = s \Big ]. \end{eqnarray*} The corresponding action value function $Q^\pi: \mathcal{S\times A} \rightarrow \mathbb{R}$ of a policy $\pi$ is defined as $$ Q^\pi(s,a) := E \Big [ \sum_{t=0}^\infty \gamma^t r(s^t, a^t) \big | (s^0, a^0)=(s, a) \Big ], $$ $\forall (s,a) \in \mathcal{S \times A}$. There exists an optimal policy $\pi^\star$ that simultaneoulsy maximizes $V^\pi(s) \; \forall s \in \mathcal{S}$, and $Q^\pi(s,a) \; \forall (s,a) \in \mathcal {S \times A}$.

This MDP setup is flexible enough to deal with several configurations, e.g., off-policy RL. In the generative case, the probability transition kernel $\mathcal{P}$ is empirically estimated from the data.
lt allows to provides guarantees on the reliability of the algorithms used for estimating the optimal value function $V^\star := V^{\pi^\star}$, or optimal action value function $Q^\star := Q^{\pi^\star}$. In particular, new results show that the minimum sample size for achieving the minimax sample complexity is $ N \geqslant O\left(\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}\right)$.


Generative AI

Generative AI is a part of automation technologies: it generates text, programming code and images using as database quite the exhaustive content of the web, using attention mechanisms, transformer architectures and autoregressive probabilistic models.
These combinatorial generative models, based on the linguistic representation of the physical world, are working inside the box of that representation, and then are directly unable to plan and produce any content outside that box. They do not possess complete deductive closure as they might not always derive every possible conclusion from the information they have. The output of LLM is similar to Kahneman's (2011) Type 1 system.

Although the syntax of the generated text is correct, as the correctness of the grammar is the straight consequence of the autoregressive probabilistic tools of generative AI, the meaning of this text must be evaluated. Generally, the output of these generative models must still be checked, e.g., the generated code must be verified by an interpreter/compiler, the multi-step reasonning of a maths solver must be verified (Cobbe et al., 2021), historical images inconsistencies must be corrected, etc.

Recent research has shown that the reasoning properties of large language models (LLMs) can be enhanced by equipping them with reasoning and planning capabilities that are closer to Kahneman's (2011) Type 2 system.
The Chain-of-Thought (CoT) prompting (Wei et al., 2022) is a series of intermediate natural language reasoning steps along a single path that leads to the final output. This is an extension of the few-shots prompting properties of LLM (Brown et al., 2020): for the few-shot prompting, the LLM is provided with a small number of in-context (input, output) pairs of examples, or 'shots'. In contrast, for the CoT prompting, the LLM is provided with triples (input, chain-of-thought, output), where the chain-of-thought consists of a series of intermediate reasoning steps that lead to the solution of the studied problem.
CoT prompting enhances the model's performance without requiring extensive retraining. The single pathway Chain of Thought (CoT) prompting has been further generalized by Tree of Thoughts (ToT) prompting (Yao et al., 2024), which is characterized by multiple competing reasoning paths modeled as a tree. The formalization of a human problem solving as a search problem along a tree has been advocated by Newell and Simon (1972).

Amazon Bedrock is the standard framework used for our Generative AI services.


References

  1. Bahdanau, D., Cho, K., and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  2. Belkin, M. 2021. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numer. 3, 203-248. MR4298218
  3. Bishop, C.M. and Bishop, H., 2023. Deep learning: Foundations and concepts. Springer Nature. MR4719738
  4. Brown, T. B. et al., 2020. Language models are few-shot learners. arXiv preprint ArXiv:2005.14165.
  5. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R. and Hesse, C., 2021. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168.
  6. Fujisawa, I., Nobe, S., Seto, H., Onda, R., Uchida, Y., Ikoma, H., Chien, P.C. and Kanai, R., 2024. ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure. arXiv preprint arXiv:2410.03117.
  7. Geiping, J., Garrido, Q., Fernandez, P., Bar, A., Pirsiavash, H., LeCun, Y., and Goldblum, M., 2023. A Cookbook of Self-Supervised Learning. arXiv preprint arXiv:2304.12210.
  8. Goyal, A. and Bengio, Y., 2022. Inductive biases for deep learning of higher-level cognition. Proc. R. Soc. A.478 no.2266, Paper No. 20210068, 35 pp. MR4505795
  9. Hinton, G.E. and Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural networks. Science, 313(5786), pp.504-507 MR2242509
  10. Hinton, G., 2023. How to represent part-whole hierarchies in a neural network. Neural Computation, 35(3), pp.413-452. MR4555492
  11. Kahneman, D., 2011. Thinking Fast and Slow. Macmillan.
  12. Kambhampati, S., 2024. Can large language models reason and plan?. Annals of the New York Academy of Sciences, 1534(1), pp.15-18.
  13. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. and Iwasawa, Y., 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35, 22199-22213.
  14. Lan, G., 2023. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes.Math. Program. 198, no.1, Ser. A, 1059-1106. MR4550970
  15. Li, G., Wei, Y., Chi, Y. and Chen, Y., 2024. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Oper. Res.72, no.1, 203–221. MR4705834
  16. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G. and Petersen, S., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540), pp.529-533.
  17. Mohri, M., Rostamizadeh, A. and Talwalkar, A., 2018. Foundations of machine learning, second edition. Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA. MR3931734
  18. Newell, A., Shaw, J.C. and Simon, H.A., 1959, June. Report on a general problem solving program. In IFIP congress, Vol. 256, p. 64.
  19. Newell, A. and Simon, H., 1972. Human problem solving, ACS symposium series, Prentice Hall.
  20. Northcutt, C., Jiang, L. and Chuang, I., 2021. Confident learning: Estimating uncertainty in dataset labels. J. Artificial Intelligence Res. 70, pp.1373-1411. MR4306614
  21. Rae, J.W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S. and Rutherford, E., 2021. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446.
  22. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. and Chen, Y., 2017. Mastering the game of go without human knowledge. Nature, 550(7676), pp.354-359.
  23. Szepesvàri, C., 2022. Algorithms for reinforcement learning, reprint of the 2010 original, Synthesis Lectures on Artificial Intelligence and Machine Learning, 9, Springer, Cham, MR4647095
  24. Teyssière, G., 2022. Review of "Belkin, M. (2021). Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation". Mathematical Reviews, American Mathematical Society.
  25. Teyssière, G., 2023. Review of "Lan, G., 2023. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes". Mathematical Reviews, American Mathematical Society.
  26. Teyssière, G., 2024. Review of "Li, G., Wei, Y., Chi, Y. and Chen, Y., 2024. Breaking the sample size barrier in model-based reinforcement learning with a generative model". Mathematical Reviews, American Mathematical Society.
  27. Vaswani, A. et al., 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.
  28. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., and Zhou, D., 2022. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22), 35, Article No 1800, 24824-24837.
  29. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K., 2024. Tree of thoughts: Deliberate problem solving with large language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23), 36, Article No 517, 11809-11822.
  30. Zbontar, J., Jing, L., Misra, I., LeCun, Y. and Deny, S., 2021. Barlow Twins: Self-Supervised Learning via Redundancy Reduction. Proceedings of the 38th International Conference on Machine Learning, 139, 12310-12320.