Artificial Intelligence III

Pre-Training

Jun 01, 2025

The third of my six short podcasts on Artificial Intelligence has now been posted. After two preliminary episodes on computers and on libraries, this episode begins the explanation of how AI works.

There are four stages to creating an AI platform: pre-training, training, post-training, and reinforcement training.

Pretraining is essentially the gathering and preparation of data. The AI computers scour the Internet for all the data available: text, audio, video, images. (In practice there are some limitations applied as certain sites are considered ‘more equal than others’.)

Then this data is ‘tokenized’. I’ve used the example of English text, because it’s easier to grasp and the principle is the same for all data. There are about 50,000 words in English; but there are over 200,000 tokens. Some tokens are single words, most are parts of words. (The latest information I have is that most AI platforms have over 15 trillion tokens.)

Each token is represented by a decimal number. Each decimal number is converted into a binary number (0s and 1s). And, of course, each binary digit is represented in a transistor in computer storage as either a high voltage or a low voltage.

So now the first stage of AI, pre-training, is complete. Basically you have a one-dimensional array of tokenized data, a massive string of 0s and 1s which exists as a collection of high and low voltages in transistors.

Note: In an old-fashioned library, none of the knowledge represented in the books (“storage devices”) is self-aware, i.e. there is no active knowledge or genuine ‘intelligence’. In exactly the same way none of the data represented in the transistors of the computer is genuine intelligence either. It’s artificial (just like the conventional symbols [letters] in a book are artificial).

Will these stored tokens become genuine intelligence in the next, training, stage? Stay tuned.

Fr. Joseph Fessio, S.J.

Discussion about this post