Learning AI - 1

1 May 2025 ai books learning

I’ve just started reading AI Engineering by Chip Huyen, published in December 2024.

Here are some brief notes as I seek to understand the material. I am linking to other articles as I find them, in order to better understand the topic.

Chapter 1. Introduction to Building AI Applications with Foundation Models

It’s always going to be a challenging read, when you don’t understand the meaning of all the words in the chapter title. What does “Foundation Model” mean?!

The emergence of model as a service means that models are made available for others to use as a service.

Foundation models emerged from large language models, which, in turn, originated as just language models.

Language models and Information Theory

A language model encodes statistical information about one or more languages.

For example, given the context “My favourite colour is __”, a language model that encodes English should predict “blue” more often than “car”.

I have in my head that the designers of Scrabble looked at the frequency of letters in written English by analysing the front page of a major newspaper.

The foundations are very early - citing Claude Shannon’s work on Information theory and his 1951 paper “Prediction and Entropy of Printed English”. I found it fairly easily online, but often behind paywalls.

Grant Sanderson on the excellent 3Blue1Brown has a 30-minute video on Solving Wordle using information theory with a great subtitle:

An excuse to teach a lesson on information theory and entropy.

Tokens

Miguel Grinberg writes in his post How LLMs Work, Explained Without Math:

A token is the basic unit of text understood by the LLM. It is convenient to think of tokens as words, but for the LLM the goal is to encode text as efficiently as possible, so in many cases tokens represent sequences of characters that are shorter or longer than whole words. Punctuation symbols and spaces are also represented as tokens, either individually or grouped with other characters.
The complete list of tokens used by an LLM are said to be the LLM’s vocabulary, since it can be used to express any possible text.

Masked language model

The two main types of Language models (masked and autoregressive) differ based on what information they can use to predict a token

IBM have a good introduction on how masked language models work:

Masked language models (MLM) are a type of large language model (LLM) used to help predict missing words from text in natural language processing (NLP) tasks. By extension, masked language modeling is one form of training transformer models—notably bidirectional encoder representations from transformers (BERT) and its derivative robustly optimized BERT pretraining approach (RoBERTa) — for NLP tasks by training the model to fill in masked words within a text, and thereby predict the most likely and coherent words to complete the text.1
Masked language modeling aids many tasks — from sentiment analysis to text generation — by training a model to understand the contextual relationship between words.

Autoregressive language model

An autoregressive language model is trained to predict the next token in a sequence, using only the preceding tokens. It predicts what comes next in “My favourite colour is __.” An autoregressive model can continually generate one token after another.

The AWS article What are Autoregressive Models is helpful:

Autoregressive models are a class of machine learning (ML) models that automatically predict the next component in a sequence by taking measurements from previous inputs in the sequence. Autoregression is a statistical technique used in time-series analysis that assumes that the current value of a time series is a function of its past values. Autoregressive models use similar mathematical techniques to determine the probabilistic correlation between elements in a sequence. They then use the knowledge derived to guess the next element in an unknown sequence.
For example, during training, an autoregressive model processes several English language sentences and identifies that the word “is” always follows the word “there.” It then generates a new sequence that has “there is” together.
Generative artificial intelligence (generative AI) is an advanced data science technology capable of creating new and unique content by learning from massive training data.

Back to Chip Huyen:

Today, autoregressive language models are the models of choice for text generation, and for this reason, they are much more popular than masked language models.