How did we make stardust think?

From carbon atoms forged in dying stars to neurons firing in your skull to silicon learning to see. The improbable chain that led to artificial intelligence. A first principles and historical journey through neural foundations, backpropagation, and recurrence.

Our brain does it through something like energy minimization given billions of years of data by leveraging the laws of physics and chemistry.

Humanity has done its best to reverse engineer this process.

It started from understanding the fundamental building block of the brain, the neuron, when Golgi figured out staining. Others built on this work and we discovered that our brain is an incredibly complex mess of 100 billion neurons and a quadrillion connections between these neurons called synapses.

And that the biggest part of our brain, the cerebrum, is broadly the same building block copied over and over again with the simple functions of inhibition or excitation through action potential.

Broadly.

In truth, we barely understand it (but we do understand much more than my simplified description that I wrote for the intro).

We do know it has a remarkably efficient ability to map the data distribution of reality.

So how do we make sand think?

We copy other stardust that can think.

What is Thinking?

We won't get philosophical here. For the purposes of this article, thinking will mean something narrower, operational, and useful.

Across neuroscience and modern AI, there is growing convergence on a simple idea:

Thinking is the construction and use of internal models that capture the structure of the world.

Different researchers emphasize different aspects of this process, but the core remains the same.

Yoshua Bengio frames intelligence as the ability to learn abstractions. These are representations that compress experience while preserving the underlying causal and compositional structure of reality. Thinking, in this view, is the manipulation of these abstract world models to reason, generalize, and act.
Jeff Hawkins approaches the problem from neuroscience. His central claim is that the brain is fundamentally a prediction machine. The better a system can predict what will happen next (across time, space, and modality), the more intelligent it appears.
Karl Friston frames thinking as inference under uncertainty. In his view, biological intelligence emerges from systems that minimize prediction error (or "free energy") by continuously updating an internal generative model of the world. Thinking, then, is the process of reducing surprise by bringing internal models into alignment with reality.

These perspectives are not in conflict. Prediction requires a model of the world, and good models are those that capture its true structure.

So for the rest of this article, we will use the following working definition:

Thinking is understanding the world well enough to predict it, and to imagine how it could be otherwise.

That definition is deliberately minimal. It avoids consciousness, qualia, or meaning, and instead focuses on what can be implemented, measured, and scaled.

The rest of this article is about how we approximate this process in silicon.

Energy Minimization

There's a deep connection here to physics. Intelligent systems seem to find minimum energy configurations: stable points where predictions match reality. In thermodynamic terms, learning can be viewed as a process that maximizes entropy over time while maintaining internal structure. The second law of thermodynamics may be more fundamental to intelligence than we realize.

How Does Our Brain Do It?

The brain's fundamental unit of computation is the neuron. Each neuron receives signals from thousands of other neurons through connections called synapses. When enough excitatory signals accumulate, the neuron "fires" and sends an electrical pulse called an action potential down its axon to connected neurons.

Some connections are excitatory (they encourage firing), others are inhibitory (they suppress it). The strength of each connection determines how much influence one neuron has over another.

Modern neuroscience increasingly views the brain as a prediction machine. Your brain isn't passively receiving sensory data. It's constantly generating predictions about what it expects to perceive, then comparing those predictions against actual input.

Let's trace through a concrete example. Watch the animation as you read.

You see a shape.

Light reflects off the object and enters your eye. Your retina converts these photons into electrical signals that travel through the optic nerve to your visual cortex.

Here's where it gets interesting. Before the signal fully propagates, your brain has already made a prediction about what it's seeing based on context, prior experience, and partial information.

Let's say your brain predicts "square" but the shape is actually a circle. This mismatch generates a prediction error.

The error signal propagates backward through the network. This is the key moment: synaptic connections that contributed to the wrong prediction get weakened, while connections that could have led to the correct answer get strengthened.

This is learning. Through mechanisms like long-term potentiation (LTP) and long-term depression (LTD), the physical structure of synapses changes based on prediction errors.

The next time you see the same shape, the updated connections produce the correct prediction: "circle." The prediction matches reality. No error signal. The pattern is reinforced.

The Core Loop

Predict → Compare → Error → Update → Repeat

This cycle runs continuously, billions of times per second, across billions of neurons. It's how the brain builds and refines its internal model of the world.

How Do We Rebuild This?

Let's start from something incredibly simple. Forget images, forget language. Let's predict a parabola.

Our "universe" follows a simple rule: $y = ax^2 + bx + c$ . A human living in this universe would just need to learn three numbers ( $a$ , $b$ , and $c$ ) to have a complete model of their world.

Here's the ground truth, the "laws of physics" in our toy universe:

But of course, a being in this universe doesn't get to see the clean equation. They observe noisy samples:

Now the prediction problem becomes clear: given only these noisy observations, can we figure out the underlying rule?

Measuring "Wrong"

Before we can improve, we need to know how wrong we are. Let's start with a guess: $a=0.5, b=0, c=0$ .

How bad is this guess? We can measure it by looking at how far off each prediction is from the actual data:

The loss function quantifies our wrongness. A simple choice is the Mean Absolute Error (MAE): add up all the ways we were wrong, then average.

Mean Absolute Error

\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

Our goal is now clear: find values of $a$ , $b$ , and $c$ that minimize this loss.

The Tedious Way

One approach: try different values for $a$ and see if the loss goes down. Then do the same for $b$ . Then $c$ .

Let's try it:

This is tedious. And there's a deeper problem: the parameters are interdependent. Improving $a$ changes how $b$ should be adjusted. They're all tangled together.

Let's visualize what the loss looks like as we vary just $a$ :

And for $b$ :

When we consider two parameters at once, we get a loss surface:

See that valley? That's where the optimal parameters live. We need to find our way there.

Derivatives to the Rescue

Here's the key insight: the derivative tells us not just whether to go up or down, but how fast the loss is changing in each direction.

The gradient is just the vector of partial derivatives, one for each parameter. It points "uphill" toward increasing loss.

So if we want to decrease loss, we go in the opposite direction:

Gradient Descent Update

\theta_{\text{new}} = \theta_{\text{old}} - \eta \cdot \nabla_{\theta} \mathcal{L}

Where $\eta$ is the learning rate, which controls how big a step we take. Too big and we overshoot. Too small and we're slow.

Let's see one gradient descent step:

Now we can run the full loop. Watch the curve converge:

Computers are really fast at this. What takes us minutes to think through, a GPU can do billions of times per second.

What If We Don't Know the Form?

We cheated. We knew the universe was a parabola, so we looked for $a$ , $b$ , $c$ . What if we didn't know the functional form at all?

Maybe we could approximate any curve by adding up simple functions. Lines are simple. What if we add a bunch of lines together?

Let's try:

Hmm. Adding lines gives us... another line. Linear functions are commutative. Their sum is always linear, no matter how many you add.

We need to break linearity.

Breaking Linearity

Enter the ReLU (Rectified Linear Unit): the simplest possible nonlinearity.

ReLU

\text{ReLU}(x) = \max(0, x)

It's just "if negative, make it zero." That's it.

Now let's add our lines again, but with ReLU applied to each:

The sum is no longer a line! By adding "bent" lines together, we can approximate curves.

Add more bent lines, and we can approximate any continuous function arbitrarily well:

This is the Universal Approximation Theorem. With enough ReLU'd linear functions, we can approximate any continuous function to arbitrary precision.

Terminology Mapping

Now we can map our intuitions to standard ML jargon:

$\max(0, x)$ = ReLU = activation function
MAE = loss function (what we optimize)
$a, b, c$ = parameters / weights
The things we're predicting = features

The Simplest Building Block

Let's put it all together to create the simplest unit of "artificial thinking": the perceptron.

A perceptron takes inputs, multiplies each by a weight, adds them up, adds a bias, and passes the result through an activation function:

Perceptron

y = \sigma(\mathbf{w} \cdot \mathbf{x} + b)

Compare this to a biological neuron:

Brain:

Signals arrive from connected neurons
Synaptic strengths determine influence
If total exceeds threshold → fire
Error signaled via prediction mismatch
Synapses strengthen or weaken (LTP/LTD)

Silicon:

Inputs arrive as numbers
Weights determine influence
Sum + bias → activation function
Error computed via loss function
Weights updated via gradient descent

The silicon version is cruder but more explicit. We can see exactly what's happening, measure it, and optimize it.

Here's a single perceptron in code:

Stack multiple perceptrons together and you get a layer. Stack layers and you get a neural network.

A simple 2-layer network:

As Andrej Karpathy puts it, we've created (in the latent space of intelligence) something closer to spirits than animals. These systems learn, but they don't learn like we do. They optimize, but they don't understand. They predict, but they don't know.

Or do they? That's a question for another article.

The Training Loop

The Chinese Room

Understanding language seems to require a robust world model. But consider Searle's Chinese Room: a person who doesn't speak Chinese sits in a room with rulebooks. They receive Chinese characters, follow rules to produce responses, and appear fluent without understanding a word.

Are our neural networks doing something similar? They manipulate symbols according to learned patterns. Whether this constitutes "understanding" remains one of AI's deepest questions.

The Chinese Room

Are our neural networks doing something similar? They manipulate symbols according to learned patterns. Whether this constitutes "understanding" remains one of AI's deepest questions.

Every modern neural network, from GPT-4 to AlphaFold, follows the same fundamental loop:

1. Initialize - Start with random weights

2. Forward pass - Push data through the network

3. Compute loss - Measure how wrong we are

4. Backward pass - Calculate gradients

5. Update weights - Take a step downhill

6. Repeat - Until loss stops decreasing

This is the same predict-compare-update loop we saw in the brain, just made explicit and computable.

But there's a problem. We want our network to process language so it can "think" about text. Our parabola example used simple numbers as input. Language is different: it's symbolic, discrete, and our perceptrons only understand continuous numerical values.

How do we encode language so it can flow through this training loop with the network architecture we've established?

Turning Language into Numbers

Our eyes turn photons into neural activation. Our ears turn pressure waves into neural activation. Our perceptron needs numbers.

So how do we turn language into numbers?

Tokenization

First, we break text into pieces called tokens. These could be words, characters, or something in between.

Modern systems use subword tokenization, a middle ground that handles common words as single tokens while breaking rare words into familiar pieces.

Try it yourself:

Tokenization

See how text gets broken down into tokens

Enter text to tokenize:

18/50

Examples:

Byte Pair Encoding (BPE)

The most common tokenization algorithm is BPE. It starts with individual characters, then iteratively merges the most frequent pair into a new token. Repeat until you have the desired vocabulary size.

"low" + "er" → "lower" becomes a single token if it appears often enough.

Numericalization

Once we have tokens, we assign each one an ID: just an index in our vocabulary.

Try converting tokens to numbers:

Numericalization

Convert tokens to numeric IDs using a vocabulary lookup

Enter text

Try an example:

Vocabulary (partial)

<PAD>: 0

<UNK>: 1

hello: 2

world: 3

i: 4

love: 5

this: 6

movie: 7

the: 8

is: 9

great: 10

terrible: 11

good: 12

bad: 13

amazing: 14

Now we have numbers. But there's a problem: these IDs are arbitrary. "cat" being ID 1 and "dog" being ID 2 doesn't tell the network that cats and dogs are similar (both animals) while "happy" is something different (an emotion).

Embeddings

The solution: instead of feeding raw IDs, we look up each ID in a learned embedding matrix. Each row is a vector of continuous numbers representing that word.

With just 2 dimensions, we can imagine what each dimension might encode:

Dimension 0: "How animal-like is this word?"
Dimension 1: "How positive is this word?"

"cat" and "dog" cluster together (both animals). "happy" and "sad" are far apart on the sentiment axis:

Hover over words to see their embedding vectors

animal

emotion

action

neutral

Real embeddings use hundreds of dimensions. The features become abstract and hard to name, but the principle remains: similar words end up near each other in this high-dimensional space.

These embeddings are learned through the same gradient descent process we saw earlier. The network starts with random vectors and adjusts them during training, gradually discovering which words should cluster together.

Learned, Not Programmed

Nobody tells the network that "cat" and "dog" should be similar. It discovers this by seeing them used in similar contexts. Just like our brains learn features through experience rather than explicit programming, neural networks learn their own representations through training.

The 2D features above ("animal-ness", "sentiment") are just for illustration. Real embeddings have hundreds of dimensions encoding features we can't easily name.

Before We Can Train

We have our network architecture, we know how to encode language, and we understand the training loop. Before we can actually train, there are a few more concepts we need to understand. This is where machine learning gets its jargon.

Data Preparation

In practice, we do not train on one example at a time. We use batches: groups of examples processed together.

Why batch? Two reasons. First, efficiency. GPUs are designed for parallel computation. Processing 32 examples at once is barely slower than processing 1. Second, stability. Gradients computed from a single example can be noisy. Averaging over a batch gives a more reliable signal.

We also split our data into three sets:

Training set: What the network learns from. Gradients are computed on this data.
Validation set: Used to tune hyperparameters and check for overfitting. The network never trains on this.
Test set: The final exam. Touched only once, at the very end, to get an unbiased performance estimate.

This separation prevents a subtle trap: if we tune our model to perform well on the same data we test on, we might just be memorizing rather than learning.

Loss vs Accuracy

There are two ways to measure how well our network is doing, and they serve different purposes.

Accuracy is what we actually care about: the percentage of examples classified correctly. If we predict "positive" for 85 out of 100 positive reviews, our accuracy is 85%.

Loss is what we optimize. It's a differentiable proxy that tells us not just whether we were wrong, but how confident we were in our wrong answer.

Why Not Just Optimize Accuracy?

Accuracy is not differentiable. An example is either correct or incorrect, there's no gradient to follow. Loss provides a smooth landscape where gradient descent can work.

During training, we track both metrics. Loss tells us how well the optimizer is working, while accuracy tells us how well the model is actually performing on the task we care about.

Training Over Time

One pass through the entire training set is called an epoch. Training typically involves many epochs. Early on, loss drops rapidly as the network learns the basic patterns. Later, improvements become smaller.

The Foundation Is Set

We now have all the pieces: networks that can learn, language converted to numbers, and a training process to tie it all together. The core loop is simple: predict, compare, update, repeat.

In the next section, we'll put these pieces together and watch a neural network learn. You'll train a sentiment classifier yourself and see exactly how data, compute, and scale determine what these systems can do.