I spent 2 weeks playing god using genetic algorithms. Here are my learnings.

There was a video I saw 9 years ago on YouTube from Carkh from showing simple creatures evolving to pick up pellets. For the longest time I've wanted to create my own version but it was never a priority.

Until I nerd sniped myself. While writing my "I was a top 0.01% Cursor user. Here's why I switched to Claude Code 2.0" article I wanted to show off what Claude Code could do and thought this was one of the coolest things I could one shot, grab a gif, and move on.

281 commits later, with the computational application of Darwin's consecrated knowledge running through my cortical connections, I have a working evolution simulator.

I used Claude Code to help me build this. I'll be honest about where that helped and where it didn't.

All code is at github.com/SilenNaihin/genetic-algorithm.

Our creatures

Evolution was able to create the most complex collections of matter in the universe (ourselves).

Nature doesn't have access to backpropagation or even local learning rules as in the brain. It has to use population level rules that comply with the laws of physics.

Genetic algorithms simulate this Darwinian process: measure how well a organism does in an environment, murder them in cold blood if they aren't performing well, and the rest reproduce with a chance of mutation. Repeat.

Our creatures are made of nodes (spheres) connected by muscles (springs).

Nodes have friction and size:

Muscles have a rest length (natural length), stiffness (how hard it pulls), and damping (how quickly it settles):

Basic muscle properties — Muscle configuration: rest length, stiffness, and damping

The muscle pulls toward its rest length using Hooke's law with damping:

Spring Force (Hooke's Law + Damping)

\vec{F} = -k(|\vec{\Delta x}| - L_0)\hat{d} - c(\vec{v}_{rel} \cdot \hat{d})\hat{d}

Where $k$ is stiffness, $c$ is damping, $L_0$ is rest length, $\vec{{\Delta x}}$ is the current length, and $\hat{{d}}$ is the direction between nodes.

Super simple right? Well unfortunately the constraints of reality aren't baked into a physics sim by default. To give you a taste:

Muscles could initially contract to zero length or extend infinitely. That's not how muscles work. I had to clamp contraction to a percentage of rest length.
My damping was initially too low and as a result muscles would oscillate wildly and fail to settle whenever the creature touched the ground. Even without any locomotion mechanisms (without any muscles contracting). Increased damping from 0.5 to 3.0 (b043a97).

The fitness function

This is how we calculate how well a creature performed. The fitness function defines what "good" means.

Fitness

F = 100 \cdot P_{collected} + P_{progress} + D_{travel} - E_{cost} - R_{penalty}

Alright,Claude please look at my codebase and make a list of the different components of the fitness function we ended up with. Make no mistakes. (it made mistakes and it would have been quicker for me to write this out):

Pellet collection: 100 points per pellet. When you collect, your progress converts to collection points (not added on top).
Progress toward current pellet: 0-80 points based on how much closer you got. Measured from the edge of the creature, not the center.
Distance traveled: 0-20 points, capped. Ground distance in the XY plane only (not vertical movement).
Efficiency penalty: Penalizes excessive muscle activation (encourages efficient movement).
Regression penalty: Penalizes moving away from the pellet (only after first collection).

Getting the fitness function right was 10x more difficult than getting Claude to understand the nuance of our current fitness function.

Progress banking bug

When a creature collected a pellet, their fitness would drop to 20 instead of keeping the 100 points. Progress was being reset to 0, and a pellet was just adding 20.

"Claude pls what was not clear we want to bank progress at 100 points when we collect a pellet" (0f7f946).

Progress baseline position

Progress was being calculated from where the creature spawned, not from where it was when it picked up the last pellet. "Claude pls reset the baseline position after each collection"

Center vs edge calculation

I was measuring distance from creature center to pellet center. But creatures have different sizes. A large creature could "reach" a pellet while its center was still far away. Had to calculate from the edge of the creature instead.

The edge calculation itself was tricky: I needed a stable radius from the genome (rest state), not the current physics state. Otherwise the radius oscillates with muscle animation and fitness swings wildly (3bde5ec).

Uncapped distance reward

I added the 20 points bonus for distance traveled to give the creature a gradient to maximize while it hasn't learned to move in a direction yet.

Claude decided to interpret this as making the reward absolute to "encourage more movement". Below is the result for the kind of creatures we evolved. Sad to think so many locomotive creatures were exterminated because their environment was so hostile.

Creature moving erratically with uncapped distance

Fitness configuration panel — Fitness configuration in the UI

Part 1: Brainless oscillation

For my first attempt to get the creatures to optimize towards this fitness function was to give the muscles evolvable oscillation parameters: amplitude (range of contraction), frequency (oscillation speed), and phase offset (timing in the cycle).

Instead of pulling toward a fixed rest length, muscles now pull toward an oscillating target:

Target Length

L(t) = L_0 \cdot \left(1 - A \cdot \sin(2\pi f t + \phi)\right)

Where $A$ is amplitude, $f$ is frequency, and $\phi$ is phase offset. The spring force now pulls toward $L(t)$ instead of $L_0$ :

Spring Force with Oscillation

\vec{F} = -k(|\vec{\Delta x}| - L(t))\hat{d} - c(\vec{v}_{rel} \cdot \hat{d})\hat{d}

Creature with oscillating muscles — Node color = friction (cyan = slippery, orange = grippy). Muscle thickness = stiffness. Muscle color = frequency (blue = slow, red = fast).

First we catapult the bottom 50% (roughly) of creatures out of the gene pool based on fitness.

Then survivors reproduce, either through direct cloning or crossover ;) with another survivor, always followed by mutation.

Finally, we simulate the new generation and measure their fitness. Repeat.

Evolution loop: simulate, select, reproduce — This process happens every generation: simulate all creatures, kill the worst, reproduce the best.

At this point our creatures are brainless oscillators.

Naturally, several problems emerged.

Sometimes the simulation would just explode. Creatures would fly off to infinity. I had to add checks to disqualify creatures with invalid or NaN fitness values. I say this plainly, but there were many things that were causing this. For example: (6715202).

Pellets were spawning too close to the creature. A creature could collect multiple pellets without moving much at all, just by being in the right spot when the next pellet appeared.

The fix: spawn pellets at least 5 units away from the creature's edge (not center), in the semicircle opposite to the creature's current direction of motion. This forces the creature to actually travel to collect each pellet.

Our best creatures with pure oscillation mechanics evolved to spaz out in a radius and occasionally bump into pellets. Which is pretty much all we could hope for without any ability to respond to the environment.

Creature spazzing in a circle — Creatures winning by vibrating in circles, not walking

So let's upgrade the genotype. Time to IQ max.

Part 2: Adding brains

Each creature gets a small feedforward network: sensory inputs → hidden layer → muscle outputs. The network outputs one value per muscle in $[-1, 1]$ , which directly controls muscle length: $L(t) = L_0 \cdot (1 - y_m)$ . Output of +1 means fully contracted, -1 means fully extended.

Neural network controlling muscle contraction — Example config: 8 inputs → 20 hidden neurons → 6 muscle outputs. Output labels show which nodes each muscle connects (e.g., '1-5' = muscle between node 1 and node 5). Green = contracting, orange = extending.

Input Type	Input Count	What it tells the creature
Pellet direction	3	Where is the food? (unit vector, x, y, z)
Velocity direction	3	Which way am I moving? (x, y, z)
Distance to pellet	1	How far is the food?
Time encoding	0-2	What time is it in the simulation? (ex. oscillates between -1 and 1 every 2 sec)
Muscle strain	0-15	How stretched is each muscle? (x, y, z for each muscle)
Node velocities	0-24	How fast is each body part moving? (x, y, z for each node)
Ground contact	0-8	Which parts are touching the ground? (0 or 1 for each node)

The basic version uses 7 inputs (pellet direction, velocity, distance). The full version can include proprioception (muscle strain, node velocities, ground contact) for up to 54 inputs total. Hidden layer size is configurable (8-32 neurons typical).

Now there is no base oscillation anymore. The network has full control over when and how each muscle contracts.

And the creatures failed to learn anything. Even their spazzing was ineffective.

Creatures with neural networks performing poorly

I decided to take matters into my own hands. I asked Claude something like what is wrong with our creatures? make no mistakes or else a random child across the world will lose their favorite stuffed animal

The conversation that followed made me realize I can't delegate everything to Claude without understanding the codebase myself.

Basically, a lot had gotten lost in the details. Some examples:

We were using Xavier initialization, which clusters weights near zero. For GA, you want more variance so the initial population explores different behaviors, not all starting with near-silent outputs.
Any non zero output activated muscles. An output of 0.01 still causes 1% contraction which means the network can never produce true silence. I added a dead zone to the output neurons.
NN outputs were updating every physics step. At 60 FPS, muscles get new target lengths 60 times per second. Small input changes cause rapid output oscillation, making creatures jitter chaotically. Fixed by caching outputs for 4 physics steps (6c94e32) and adding exponential smoothing (e97d3ef).

After diving into the details and fixing things I saw improvement for the first time for more than 20 generations.

Mutation strategies

There are many reproduction strategies

Crossover Type	What it does	Trade-off
Uniform	Each weight randomly from parent A or B	Maximum mixing, can destroy coordinated weights
Interpolation	Weighted average between parents	Smoother blending, less exploration
Single-point	All weights before point from A, after from B	Preserves local structure, less mixing

and mutations strategies

Mutation Type	What it does	When it helps
Weight perturbation	Add Gaussian noise to existing weights	Fine tuning an already good solution
Weight replacement	Replace weight with new random value	Escaping local optima, exploring new regions
Body mutation	Modify node and muscle parameters	Evolving morphology alongside behavior
Structural (NEAT), more on this later	Add/remove neurons and connections	Finding simpler or more complex architectures

that I experimented with.

For weight mutations, magnitude matters a lot (9324dec). Weight perturbation adds Gaussian noise with standard deviation σ to each weight. But when you do this across many weights, the total displacement in weight space scales with the square root of dimensions:

Expected Displacement

\mathbb{E}[\|\Delta w\|] = \sigma \sqrt{n}

Think of the neural network as a single point in high dimensional space, where each weight is one coordinate. A network with 200 weights is a point in $\mathbb{R}^{200}$ . When you mutate, you move from one point to another. The "distance" is just the L2 norm between old and new weight vectors.

High-dimensional noise explodes in norm

σ isn't just a per-weight tweak. In high dimensions, it defines how far the entire network jumps as a function. Even tiny per-weight noise becomes a huge functional move once you aggregate across hundreds of dimensions.

High-dimensional noise explodes in norm

σ isn't just a per-weight tweak. In high dimensions, it defines how far the entire network jumps as a function. Even tiny per-weight noise becomes a huge functional move once you aggregate across hundreds of dimensions.

In a ~200 dimensional network: σ=0.3 gives $0.3 \times \sqrt{200} \approx 4.2$ . Since individual weights are typically magnitude ~1, moving 4.2 units means many weights changed by ~30%. You've left the local basin and the network's behavior is mostly destroyed. That's a random restart, not optimization. σ=0.05 gives $0.05 \times \sqrt{200} \approx 0.7$ . Small coordinated nudges across many weights. The network function is mostly preserved. You're still on the same fitness ridge and can hill-climb.

Our later neural architecture search confirmed this: aggressive body mutation with conservative weight mutation worked best. Focus evolution on morphology, let weights fine-tune.

What creatures actually learned

I expected creatures to evolve walking gaits: rhythmic, coordinated movements like animals. They didn't. I built an activation analysis notebook to understand what was actually happening (with the help of Claude Code of course).

Neural network outputs over time — NN outputs for each muscle over a 20 second simulation. Most outputs hover near zero (inactive). Compare to a walking gait which would show regular oscillation.

The dominant oscillation frequency was 0.17 Hz, much slower than typical locomotion gaits. Creatures evolved aperiodic, exploratory movements that happen to reach pellets. They didn't walk, they strategically flailed.

The best performing creatures had a mean output of -0.12, with most outputs hovering near zero (in the deadzone). The failing creatures had mean positive outputs and more chaotic activation patterns.

Input signals over time — Sensory inputs during a 20-second simulation. Top: pellet direction (x,y,z). Middle: velocity direction. Bottom: distance to pellet.

The diversity collapse

After a few successful runs, I noticed a pattern. Runs would improve for up to 50 generations, then plateau. Looking at the population, everyone had converged to the same strategy. The top 50% survive, they're all similar, they breed, offspring are even more similar. Eventually everyone is a minor variation of the same local optimum.

This is a known problem. I started reading about diversity maintenance: fitness sharing, tournament selection, and how the famous NEAT paper does it.

Selection strategies

I experimented with three selection methods:

Method	How it works	Trade-off
Truncation	Kill bottom 50%, clone survivors	Simple but aggressive. Fast convergence, loses diversity quickly.
Rank	Selection probability proportional to rank, not raw fitness	Gentler pressure. Creature at rank 2 isn't 10x more likely to survive than rank 10.
Tournament	Pick k random creatures, best one survives	Stochastic. Weaker creatures in weak groups can survive, preserving diversity.

Tournament selection (d3e7a8c) adds randomness. Pick k=3 creatures at random, keep the best. A mediocre creature in a group of three bad ones survives. This lets "stepping stone" genomes persist, ones that aren't great now but might lead somewhere good.

In Theory.

After our neural architecture search, I realized that rank and tournament selection didn't help at all. Go figure.

Comparison of selection methods — Different selection methods produce different fitness distributions. From NAS analysis.

If two creatures are similar, they split their fitness. This penalizes crowded regions of the search space. The intuition: imagine 10 creatures all clustered around the same local optimum. Without fitness sharing, they'd all survive and breed, making the population even more homogeneous. With fitness sharing, they divide the reward among themselves, so one novel creature exploring elsewhere might actually have higher effective fitness.

The formula (Goldberg & Richardson, 1987):

Fitness Sharing

f'_i = \frac{f_i}{1 + \sum_{j \neq i} sh(d_{ij})}

Each creature's fitness gets divided by a "niche count": how many similar creatures exist. The $sh(d)$ function determines how much two creatures "share" based on their distance:

Sharing Function

sh(d) = \begin{cases} 1 - \left(\frac{d}{\sigma_{share}}\right)^\alpha & \text{if } d < \sigma_{share} \\ 0 & \text{otherwise} \end{cases}

The key parameter is $\sigma_{share}$ , the sharing radius. It defines "how different is different enough." If two creatures have distance $d < \sigma_{share}$ , they're considered similar and share fitness. If $d \geq \sigma_{share}$ , they're far enough apart to not affect each other.

When $d = 0$ (identical creatures), $sh(0) = 1$ , meaning full sharing. As distance increases toward $\sigma_{share}$ , sharing decreases linearly (when $\alpha = 1$ ). At the boundary and beyond, $sh(d) = 0$ , no sharing.

For neural networks, I computed the RMS (root mean squared) Euclidean distance across all weight matrices. By flattening both networks' weights into vectors, computing the element-wise differences, squaring them, averaging, and then taking the square root. This gives a single number representing how different two brains are.

def neural_genome_distance(genome1, genome2) -> float:
    ng1 = genome1.get('neuralGenome')
    ng2 = genome2.get('neuralGenome')

    total_squared_diff = 0.0
    total_weights = 0

    # Compare all weight matrices
    for key in ['weights_ih', 'weights_ho', 'biases_h', 'biases_o']:
        w1 = _flatten(ng1.get(key, []))
        w2 = _flatten(ng2.get(key, []))

        min_len = min(len(w1), len(w2))
        for i in range(min_len):
            diff = w1[i] - w2[i]
            total_squared_diff += diff * diff
            total_weights += 1

        # Penalize size mismatch (topology difference)
        size_diff = abs(len(w1) - len(w2))
        total_squared_diff += size_diff * 4.0  # max diff squared
        total_weights += size_diff

    # Root mean squared distance
    return math.sqrt(total_squared_diff / total_weights)

def neural_genome_distance(genome1, genome2) -> float:
    ng1 = genome1.get('neuralGenome')
    ng2 = genome2.get('neuralGenome')

    total_squared_diff = 0.0
    total_weights = 0

    # Compare all weight matrices
    for key in ['weights_ih', 'weights_ho', 'biases_h', 'biases_o']:
        w1 = _flatten(ng1.get(key, []))
        w2 = _flatten(ng2.get(key, []))

        min_len = min(len(w1), len(w2))
        for i in range(min_len):
            diff = w1[i] - w2[i]
            total_squared_diff += diff * diff
            total_weights += 1

        # Penalize size mismatch (topology difference)
        size_diff = abs(len(w1) - len(w2))
        total_squared_diff += size_diff * 4.0  # max diff squared
        total_weights += size_diff

    # Root mean squared distance
    return math.sqrt(total_squared_diff / total_weights)

This didn't really help, but I didn't spend enough time debugging to find out why.

Instead I decided to implement a paper in which these things had already been solved and work. NEAT (NeuroEvolution of Augmenting Topologies).

Part 3: NEAT

NEAT asks 'are we limiting evolution by fixing the network structure?'

Every creature had the same architecture: 7 inputs, one hidden layer, N outputs. But some tasks may need more hidden neurons. And some connections could be useless.

Why am I still hand designing the topology of the network like a troglodyte instead of letting evolution figure it out? I should be evolution maxxing.

NEAT can mutate everything about the network topology:

Mutation	What it does	Effect
Add connection	Creates a new connection between two unconnected nodes	Increases network connectivity
Add node	Splits an existing connection by inserting a node in the middle	Increases network depth/complexity
Mutate weight	Perturb (90%) or replace (10%) connection weight	Fine-tunes or escapes local optima
Enable connection	Re-enables a disabled connection	Can reactivate old genes
Disable connection	Disables an existing connection	Prunes connections without deleting them

From these mutations, networks can start with 0 connections and hidden nodes and grow to be as complex as needed.

NEAT network evolution preview — It's pretty NEAT

These mutations mean every creature can have a different network structure.

But that creates a problem: how do you do crossover between two networks with different topologies?

Crossover with variable topology

NEAT's solution: every time a new connection or node is added anywhere in the population, it gets a globally unique ID called an innovation number. This lets you align genes from two parents by their historical origin, not their position in the genome.

Innovation numbers diagram showing how genes are tracked

Innovation numbers solve crossover alignment. When two parents have genes with the same innovation number, those genes came from the same ancestral mutation. They're homologous.

Genes that don't match are either disjoint (in the middle) or excess (at the end). The offspring inherits matching genes from either parent randomly, plus all disjoint/excess genes from the fitter parent.

Speciation

NEAT uses these same concepts (matching, disjoint, excess genes) to measure how different two genomes are.

Instead of following our neanderthal truncation rules where the bottom 50% of creatures are vaporized into context, we can use speciation to protect new structures.

This is useful for a mutation that adds a node that hurts fitness initially. With speciation, it competes only against similar genomes, giving it time to optimize.

NEAT introduces a compatibility distance that determines whether two creatures belong to the same species:

Compatibility Distance

\delta = \frac{c_1 E + c_2 D}{N} + c_3 \bar{W}

Think of δ as "genome distance". A single number measuring how different two creatures are. More mismatched genes (E, D) and bigger weight differences (W̄) means higher distance.

You pick a threshold δ_t. If two creatures have δ = 2.3 and your threshold is δ_t = 3.0, they're in the same species (2.3 < 3.0). If creature two δ = 4.1, they're different species (4.1 > 3.0).

To assign species I iterate through creatures in order and compare each creature to existing species representatives. If δ < δ_t, the creature joins that species. If no match, we start a new species with this creature as the representative.

Species are rebuilt from scratch each generation with the first creature assigned becoming the representative. This is simple, and we end up with however many species clusters naturally form in genome space.

If this still feels confusing, this video is what I watched to get a base-level understanding of NEAT.

To add more complexity, I had to solve the problem that standard NEAT assumes fixed input/output counts.

Our creatures can mutate their bodies by adding or removing muscles ie output nodes. So creatures can have different output counts.

I added a term: $c_4 |O_1 - O_2|$ , where $O$ is the number of output neurons (one per muscle).

This output count penalty is a pragmatic fix. A more principled approach would bind actuators to structure, as in Karl Sims' tree-structured genomes, where body parts and controllers are inherited together. HyperNEAT achieves a related effect by generating connections as a function of geometry, sidestepping explicit output alignment entirely. Future work!

When a muscle is added, I create a new output neuron with sparse random connections. When removed, I delete that output neuron and its connections.

Why? Imagine two creatures with identical hidden layers, same connections, same weights. Standard NEAT would say δ = 0, they're twins. But one has 3 muscles and the other has 5. They're solving completely different control problems, so they should be in different species. The output count term ensures this.

In speciation, each species runs its own selection proportionally. With a 50% survival rate, a species of 10 keeps 5, a species of 50 keeps 25. There's no cap on species size. The compatibility threshold controls how many species form, and selection is proportional within each.

Bugs everywhere

Canonical NEAT actually allows recurrent connections; cycles, self-loops, arbitrary directed graphs. I disabled recurrence for simpler debugging and because I didn't think memory was necessary for this task. A future direction would be to test with recurrence enabled.

Bug	What happened	Commit
Cycles forming	Network execution hangs or loops forever	e28f706
Invalid crossover	Output neurons used as connection sources	9b5ff50
Wrong output removed	Deleting muscle removed wrong neuron	9a28945
Hidden nodes at wrong depth	Hidden neurons overlapping inputs in visualizer	c93b8b1
Clones not mutating	50% of population frozen (not evolving)	849cb4e
Rates 10x too low	Using 5%/3% instead of NEAT standard 50%/20%	43e02d3

Etc.

Most of these bugs came from letting Claude have it's way without providing specific enough instructions.

If you're curious for specifics, read the original paper which has more specifics. For example, an input bias node. Crazy.

So how do we perform? Empirically good.

NEAT created the most "creature like" behaviors I could get. The two above are clearly able to walk and have a solid sense of direction.

But objectively bad.

I couldn't get NEAT runs to pick up more than 2 pellets, and the average rarely crossed 10 points per creature.

Time to pull out the BIG GUNS.

Neural architecture search

At this point I had 20+ hyperparameters and no idea which ones mattered. Mutation rates, crossover rates, network topology settings, speciation thresholds were all being hand tuned by my god given intuition.

Neural Architecture Search (NAS) is supposed to automates my flawed intuition into raw confidence intervals by running hundreds of trials with different parameter combinations, seeing what actually works.

I used Optuna for Bayesian optimization (8807da4). I tested three hardware configurations:

If you're a compute nerd, this is for you

GPU was slower mainly due to granularity, not raw compute. I was evaluating trials one-at-a-time on a single GPU, which meant lots of tiny kernels and frequent CPU-GPU transfers (physics/control loop ping-pong). Transfer latency added ~0.8ms per step, so total runtime was ~14 min vs ~11 min on CPU.

A GPU only beats a CPU when three conditions hold simultaneously: (1) you can batch hundreds+ of creatures per step, (2) the entire inner loop stays on GPU (state, physics, NN, reward - no ping-pong), and (3) kernel launch overhead gets amortized by large batches. My workload violated all three: sequential rollouts, physics on CPU with NN on GPU, and tiny per-step compute that overhead dominated.

Evolutionary algorithms are often CPU-native anyway. CPUs excel at irregular control flow, branching, and many independent long-running tasks. GPUs excel at dense math with regular structure. Most NEAT implementations run on CPU; GPU evo papers almost always massively batch environments or learn policies rather than rollouts. With larger populations (1000+) and a GPU-resident simulation loop, GPU could win. At current population sizes with sequential rollouts, CPUs were the right tool.

CPU parallelization initially failed for two reasons. Optuna/joblib sometimes degraded to near-sequential scheduling for long trials, so throughput was far below expected. Separately, PyTorch oversubscribed cores: each worker process spawned ~64 OpenMP/MKL threads, so multiple workers fought over the same 128 cores, causing heavy context switching (48,000/sec). Fix: OMP_NUM_THREADS=1 (and similar thread limits) inside each worker before importing PyTorch.

If you're a compute nerd, this is for you

GPU was slower mainly due to granularity, not raw compute. I was evaluating trials one-at-a-time on a single GPU, which meant lots of tiny kernels and frequent CPU-GPU transfers (physics/control loop ping-pong). Transfer latency added ~0.8ms per step, so total runtime was ~14 min vs ~11 min on CPU.

A GPU only beats a CPU when three conditions hold simultaneously: (1) you can batch hundreds+ of creatures per step, (2) the entire inner loop stays on GPU (state, physics, NN, reward - no ping-pong), and (3) kernel launch overhead gets amortized by large batches. My workload violated all three: sequential rollouts, physics on CPU with NN on GPU, and tiny per-step compute that overhead dominated.

Evolutionary algorithms are often CPU-native anyway. CPUs excel at irregular control flow, branching, and many independent long-running tasks. GPUs excel at dense math with regular structure. Most NEAT implementations run on CPU; GPU evo papers almost always massively batch environments or learn policies rather than rollouts. With larger populations (1000+) and a GPU-resident simulation loop, GPU could win. At current population sizes with sequential rollouts, CPUs were the right tool.

CPU parallelization initially failed for two reasons. Optuna/joblib sometimes degraded to near-sequential scheduling for long trials, so throughput was far below expected. Separately, PyTorch oversubscribed cores: each worker process spawned ~64 OpenMP/MKL threads, so multiple workers fought over the same 128 cores, causing heavy context switching (48,000/sec). Fix: OMP_NUM_THREADS=1 (and similar thread limits) inside each worker before importing PyTorch.

Hardware	Configuration	Result
M3 Max (local)	12 cores, sequential	~11 min/trial, reliable
T4 GPU (Azure)	CUDA, batched physics	Slower than CPU
Azure D128as_v7	128 vCPUs, parallel	Failed initially

Final runs used a CLI I built for the search.

The local NEAT run used 3 seeds per trial for variance estimation (each configuration tested with seeds 42, 123, 456). The VM runs used 1 seed per trial to maximize trial throughput, which means we're more susceptible to lucky seeds (as the reproduction results later show).

Mode	Best Fitness	Trials	Seeds	Time
Pure NN (VM)	798.6	200	1	~12 hrs
NEAT (VM)	~400	137	1	~13 hrs
NEAT (local)	441.2	100	3	~48 hrs

Pure neural networks nearly doubled NEAT's performance on this task. The simple fixed topology beat variable topology. I didn't expect this (more on this later).

I tried to reproduce the top results by running the best configurations again while capturing the full activations and physics frames. 13 reproduction runs (3 Pure, 10 NEAT) using the exact parameters from the top NAS trials:

# Reproduction run - load params from NAS trial, run in frontend
python cli.py reproduce neat-full 68 \
  --generations 200 \
  --population-size 200

# This loads trial_68.json params and runs the full evolution
# in the web UI, storing results to PostgreSQL for analysis

Trial	NAS Best	NAS Avg	Repro Best	Repro Avg
Pure #42 (top best)	798.6	81.9	420.7	58.9
Pure #178 (top avg)	587.5	118.3	129.6	24.2
NEAT #68 (top best)	441.2	27.1	312.9	32.6
NEAT #96 (top avg)	218.2	41.7	—	—
NEAT #57	439.5	27.6	609.5	34.2

NEAT #57 actually exceeded its NAS result (609.5 vs 439), a lucky seed. But Pure #42 and NEAT #68 fell far short. Pure #178's reproduction was especially disappointing - from 118.3 average down to 24.2. Across all 13 reproduction runs, the best performers were:

Metric	1st	2nd	3rd
Best fitness	NEAT #57 (609.5)	Pure #165 (330.5)	NEAT #94 (313.8)
Best average	NEAT #106 (45.0)	Pure #165 (44.3)	Pure #43 (36.4)

Genetic algorithms are stochastic. The same hyperparameters with different random seeds produce wildly different results. The NAS found configurations that can achieve high fitness, not configurations that reliably achieve it.

Bar chart showing same hyperparameters producing wildly different fitness results across seeds and reproduction attempts — Same config, different seeds: Trial #57's reproduction attempt (610) exceeded its NAS results, while #68's dropped from 441 to 313. The purple bars show reproduction attempts months later.

The best creatures collected 8 pellets, but the population mean hovered around 0.3 pellets. Most creatures just flailed in place or crawled in the wrong direction.

I had a SINGLE run where the average creature was able to pick up a single pellet. And it didn't reproduce.

The winners were outliers, not the norm. Best fitness varies wildly with luck, but average fitness never exceeded 100 (one pellet) across all 100 NAS trials.

Scatter plot showing best creature fitness vs population average, with average never exceeding 100 — Best vs average fitness across 100 NAS trials. The blue dots (best creature) vary from 100-440, but the green dots (population average) never break 42. The red line marks 1 pellet (fitness=100).

More counterintuitive results:

Pure NN beat NEAT by nearly 2x. Fixed topology outperformed variable topology. Why? Hard to say. Could be compute constraints (NEAT needs more generations to converge). Could be my speciation tuning (threshold too tight or too loose). Could be that topology search is wasted effort when a fixed 7 to 8 to N network is already expressive enough for pellet chasing. The NEAT paper's benchmarks (XOR, pole balancing) are topology sensitive problems where minimal structure matters. Pellet collection might just not be one of those. Or I have bugs. Honestly unclear.
Crossover hurts in this search (r = -0.47, p<0.001). The strongest single correlation. Best trials all had use_crossover: False. Mutation-only won. Caveat: this could be confounded with other hyperparameters. The standard explanation is that crossover destroys coordinated weight patterns. Parent A learned one strategy, parent B learned another, and mixing them scrambles both.
Time encoding hurts peak fitness (ANOVA F=6.3, p=0.002). Mode 'none': mean 333.8. Mode 'sin': mean 249.3. The network figures out timing on its own. But there's a tradeoff: time_encoding=sin produced better population learning (19% ratio) but lower peak (213 best), while time_encoding=none produced extreme elite dominance (4-7% ratio) but higher peak (441 best). If I wanted whole population learning, I'd use sin encoding and accept lower peak performance.
Proprioception hurts (p=0.12, trending). More inputs = higher dimensional search space = harder to optimize.
Full initial connectivity dominates (p=0.011). All top 5 trials used initial_connectivity: full. Mean 331.0 vs 272-296 for others.

More raw analysis in the NAS postmortem notebook.

Why genetic algorithms aren't state of the art and this project has little utility

For supervised learning with a differentiable loss function, gradient descent is provably more sample-efficient than evolution. Backprop solves MNIST in minutes with 99%+ accuracy. Deep GA would need 1000s of workers and hours to match. This is worth stating clearly: genetic algorithms are not SOTA for tasks where gradients exist.

So when should you use them?

Method	When to use
Gradient descent	Differentiable loss, supervised learning, sample efficiency matters
GA / Evolution strategies	Non-differentiable fitness, black-box optimization, massive parallelism available
NEAT	Small networks where topology matters, want to see structure emerge

Evolution Lab uses GA because the fitness function is effectively a black box. Physics simulation involves discontinuities (contacts, friction regimes), long rollouts, and chaotic dynamics where small parameter changes lead to large outcome differences. Even with simulator internals, differentiating through thousands of unstable timesteps would yield noisy, high-variance gradients. Evolution is simpler and more robust for this regime.

Uber AI's Deep Neuroevolution paper (2017) showed GAs can train networks with millions of parameters. They matched DQN and A3C on Atari in wall-clock time, despite using far more environment samples. The trick: GA is embarrassingly parallel across rollouts (each genome evaluation is independent, no replay buffers or gradient sync), so 1000 workers can compensate for low sample efficiency. Note that Atari doesn't have clean gradients either: DQN uses noisy, bootstrapped estimates, not true reward gradients. GA was competing with noisy RL, not backprop.

The real tradeoff is sample-efficient but complex (RL) vs compute-hungry but simple (GA). DQN extracts learning signal from every timestep and assigns credit to individual actions. GA only sees episode-level return and treats the policy as an indivisible blob. For most control problems, RL wins asymptotically. But for black-box, structure-evolving problems like Evolution Lab, GA trades sample efficiency for robustness and simplicity.

What I learned

Confounding variables are a pain

So many things that should work in theory don't work in practice, and I didn't have time to explore everything. Fitness sharing, speciation, NEAT, different selection strategies... the literature says these help, but I couldn't get consistent improvements. Maybe my implementations were buggy. Maybe the hyperparameters were wrong. Maybe the task is just different enough that the standard advice doesn't apply.

Theoretical details matter

Claude Code is great at writing code. It's not great at telling you when you're implementing an algorithm wrong. The NEAT bugs (wrong mutation rates, wrong crossover alignment, etc) all came from not reading the paper carefully enough.

The best workflow: understand the theory first, then use Claude to implement it. Not the other way around.

Integration testing is gold

One tool that helped was my /integration-stress-testcommand I built for Claude. When I would find a bug, Claude would first reproduce it via a test before attempting a fix.

This makes the entire codebase much more reliable. AI is not good at writing unit tests because it just tests the functionality it wrote with the same cognition as the code it wrote. So it'll often create tests with the same bugs it introduced.

Your environment is your constraint

Instead of hoping evolution learns smooth movement, make smooth movement the only option. This mirrors real biology: joints have limits, tendons only stretch so far. Evolution operates within constraints, it doesn't learn them. The fitness landscape is shaped as much by what's physically impossible as by what's rewarded.

Every time I added a physics constraint, creatures got better. Zero-length muscles led to vibration; add minimum lengths and they started walking. Per-frame output updates caused jitter; add smoothing and they moved deliberately. Each constraint removed a failure mode from the search space. The tradeoff is you might eliminate novel solutions (no catapult mechanics if muscles can't overextend), but removing degenerate solutions is usually worth it.

What's next

There's still a lot I don't understand. Why does crossover hurt? Why does proprioception hurt when it should help?

A great next goal would be to find a configuration that consistently generates populations of creatures that can pick up at least 1 pellet within 150 generations.

I have more experiments I want to try: energy systems (metabolic cost for muscle activation), multi-layer hidden networks, better NEAT crossover alignment by matching muscle innovation IDs, recurrent connections (memory), HyperNEAT (indirect encoding via CPPNs), novelty search, coevolution, interspecies mating, actually figuring out why crossover hurts, and gaining more statistical significance on the best runs.

I could keep pushing, but to be frank I need to free up the few hours a day I was spending on this to work on my other projects.

Maybe someone else will pick up where I left off and make something great (it's open source).

For now, the creatures walk. And exhibit creature like behvaiors. That's something.

Best evolved creature — The smoothest looking graph and one of the better runs we were able to reproduce.

Two weeks of staring at blobs. I learned more about genetic algorithms by building this than I would have just reading the papers. Though reading the papers first would have helped a lot.

Code is on GitHub. I'm @silennai on Twitter and my website is silennai.com.

References

Stanley, K.O., & Miikkulainen, R. (2002). Evolving Neural Networks through Augmenting Topologies. Evolutionary Computation, 10(2), 99-127. PDF
Such, F.P., et al. (2017). Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning. Uber AI Labs. arXiv
Stanley, K.O., D'Ambrosio, D.B., & Gauci, J. (2009). A Hypercube-Based Encoding for Evolving Large-Scale Neural Networks. DOI
Lehman, J., & Stanley, K.O. (2011). Abandoning Objectives: Evolution Through the Search for Novelty Alone. Evolutionary Computation, 19(2), 189-223. DOI
Sims, K. (1994). Evolving Virtual Creatures. SIGGRAPH '94. PDF
Goldberg, D.E., & Richardson, J. (1987). Genetic algorithms with sharing for multimodal function optimization. Genetic Algorithms and their Applications.
Akiba, T., et al. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. KDD '19. arXiv

Resources:

Carkh's Evolution Simulator (inspiration)
MarI/O (NEAT playing Super Mario)
NEAT-Python
Evolution Lab (this project)