Building an AI to navigate a maze

Magnus Engström
12 min readMar 8, 2019

A couple of weeks ago I set out to do a hobby project. The ambition is that I would like to create some kind of digital representation of a city block, in which I am able to simulate citizens with different preferences and motives (like going to work, shopping or finding a restaurant). This, of course, is not a type of project that I ever see myself (or anyone else for that matter) completely finishing, but I will try to iterate and increase the scope until complexity becomes to much to handle. Before I reach that point I’m hoping for an interesting and humbling journey, and I will do my best to document any progress.

This project came to be out of three reasons:

  1. I wanted some kind of visual project that would help me sharpen my understanding of neural networks, preferably through direct feedback.
  2. From time to time I like to dabble with basic game development (try Toggle, you can play it in your browser), and it was time to scratch that itch.
  3. I like the concept of genetic programming and evolutionary algorithms, and wanted a sandbox for trying it out.

I figured that the first step would be to do a pathfinding experiment, and I will go through the result of that work in this blog post. I posted a video showcasing the outcome, and before we dive into the various parts that make up this experiment I strongly recommend that you watch it.

Using AI to navigate a maze

The concept of this first pathfinding experiment is quite simple.

  1. A swarm of agents move randomly in a maze, starting from a predefined starting point.
  2. A soon as an agent hits a wall that agent is deactivated.
  3. When there are only a few active agents left the simulation stops.
  4. When the simulation stops, data from the agents with the shortest path left to the target area is used to train a neural network model.
  5. When the simulation restarts the agents will tell the model about what they “see”, and the model will give instructions about how to move in the environment.
  6. The longer time an agent lives, the higher probability of a mutation. This means that the agent will make some random movements.
  7. Sometimes a random movement places an agent closer to the target area. The machine learning model will learn from this and get better at navigating the maze (like evolution).
  8. When the AI is able to go from start to finish the starting point is moved to somewhere else in the maze. If the AI have truly learned how to make decisions based on the environment, rather than just memorizing when to do turns, a new starting position should not be a problem.

Building the environment (the maze)

For the frontend parts (the maze and the agents) I used JavaScript.

For this first project I figured that using a maze-like environment would be a good start. The final version was a simple maze, but it still had enough complexity to ensure that the neural network would have to learn about the environment before managing to navigate all the way through.

While working with the design of the maze I soon came to realize that making the maze progressively harder closer to the target area had a bad effect on the learning curve of the machine learning model. Placing the tricky parts closer to the starting point clearly generated better results.

Also, I made sure that to be able to complete the maze the AI would need to navigate in all directions.

The grid as an array and the rendered maze. 0=empty area, 1=wall, 2=spawnpoint, 3=target

The obvious go-to solution for building a simple maze is to use a grid system. The very simple concept for this is that a function goes through a table, row for row and column for column, of integers and convert every entry to a object in code.

Generating the agents

A short note on terminology: What is a good name to call a citizens of the maze world? Since it’s a maze, “player” would work. Since we are dealing with a swarm we could also go with “swarm member”. When talking about machine learning the context in which a model operates is often called an “environment”, and the representation of the model in this environment is called an “agent”. I decided to go with the later.

When doing rendering and calculations every update/cycle/iteration can be referred to as a frame. The world in which the agents live runs at 60 frames per second. Since I wrote the frontend parts in JavaScript there is a nice browser function called requestAnimationFrame() that can be used for running the cycles. The best part of using this function is that it is only active when the browser window is open, which means that I don’t have to worry about accidentally forgetting to shut down the simulation and draining the memory of my laptop.

doCycle = function() {

Every time the simulation restarts 50 agents are rendered at the starting position. The default state of every agent is to move at a random direction, but after the first restart of the simulation the instructions from the AI often takes priority. If the AI is unable to do a reasonable prediction of where to move, or if the agent is getting old, the agent will fall back on doing random movements. This is how mutations happens, and by studying the cases when a lucky mutation makes the agent perform better the AI is able to learn new things.

When an agent hits a wall that agent is deactivated. When this happens something called the A* algorithm is used for evaluating the performance. This is a very useful type of algorithm that works especially well for measuring distances in simple grid-like environments When the simulation stops the agents with the best score from using the A* algorithm are marked as the agents who will be used as training data for the machine learning model.

In this figure the best performing agent is 28 steps away from the target area.

Collecting data

Every agent logs all individual movements as long as it is active. All logs from well performing agents are used as training data for the machine learning model before the start of a new simulation cycle. A log is made up of rows of data that describes the life of an agent. There are two parts of every row: sensor data that describes the environment as the agent experienced it and a directional value that tells in what direction the agent was moving.

Collecting sensor data and detecting collisions

Data is collected through sensor data. 60 times a second (once per frame) every agent cast rays in eight directions (up, down, left, right and all diagonals). Only the closest ray intersection is stored in the agent’s log, meaning that the agent is not able to see through walls.

In this log example the agent is moving in the left direction (0). The sensor data indicates that the agent is getting closer to a wall to the left, and further away from a wall to the right.

To add some detail into this: The solution I choose for getting the correct distance measurement on each ray is by checking for line intersections. The maze is made up of a grid, and every tile of the grid is represented by a square. Every square have for sides, and every ray from every agent is checked for intersections with those lines on every frame cycle. After iterating through all possible intersections the intersection closest to the agent is used to set the distance for each ray.

rayIntersection = function(x1, y1, x2, y2, x3, y3, x4, y4) {
if ((x1 === x2 && y1 === y2) || (x3 === x4 && y3 === y4)) return false
denominator = ((y4 - y3) * (x2 - x1) - (x4 - x3) * (y2 - y1))
if (denominator === 0) return false
let ua = ((x4 - x3) * (y1 - y3) - (y4 - y3) * (x1 - x3)) / denominator
let ub = ((x2 - x1) * (y1 - y3) - (y2 - y1) * (x1 - x3)) / denominator
if (ua < 0 || ua > 1 || ub < 0 || ub > 1) return false
let x = x1 + ua * (x2 - x1)
let y = y1 + ua * (y2 - y1)
return {x, y}

This is how the JavaScript code to find a ray intersection looks. The ray starts at x1,y1 (the position of the agent) and ends at x2,y2 (a position far outside of the maze, just for safe margins). The line the ray is being checked against starts at x3,y3 and ends at x4,y4 (a side of a square). This is more or less a copy-paste from Stack Overflow

The reason for doing these compute heavy calculations instead of a simpler approach is that I can see the project in time evolving to use more complex polygonal shapes than only perfect squares.

Sensor data. The agents are only able to see and record the environment through eight directions, which gives a very low granularity (but good enough for simple navigation).

Collision detection (thankfully) comes as bonus from how the data is collected. To detect if an agent is colliding with a wall the only thing needed is to check if any sensor is returning a value close to zero. If a sensor is expected to reach zero within a few frames the agent is deactivated.

Because of the low data resolution (eight data points) it is entirely possible that some obstacles stays undetected in a dead angle until the agent is fairly close to it. To avoid this in the future I’m experimenting with a radar approach for the next iteration, in which measuring is done by using a sweeping ray rather than the static directions I use now. This will however impact performance (measuring 360 directions instead of 8), and to accomplice this I might decide to implement some kind of frame skipping. The effect a radar approach would be fewer samples (rows in the agent’s log) but higher granularity per sample.

Directional sensors are able to see five blocks in this figure. A radar would be able to see three additional blocks.

Machine learning

For the backend parts (the machine learning and the web app to serve the frontend parts) I used JavaScript. The neural network is designed using Tensorflow and Keras.

Everything regarding setting up the actual simulation was pretty straight forward. It does take some planning of the system design to make everything in its right position (and in a timely manner) and to handle game specific things like collision detection. But in the end it just comes down to tweaking all parameters until a good enough-state is reached.

To actually choose the most appropriate ML-model and then set it up can be somewhat tedious. Approaching this from the wrong direction can take much time and still yield no good results. Initially I was thinking about using a SVM (Support Vector Machine) model. But the complexity of the dataset, desired output and having the model learn over time (so called online learning, also possible with SVM but harder to evaluate) led me towards using a neural network. I will talk more about the output data later in this article.

Before doing this project I didn’t have much experience with artificial neural networks, and this field holds a broad range of different types of architectures.


Choosing the right approach took a lot of reasoning. To give an example: It can be argued that this type of training data can be seen as time series data. Let’s say that an agent is created at time zero, and every time a new row is added to the data log the aggregated number of milliseconds since the time of creation is added in a column on that row. By doing this a neural network model well suited for time series data (like LSTM) can learn the exact timing for changing direction. “When 4000 milliseconds have passed I will turn right.”. This would probably result in good performance when starting at the initial spawn point, but as soon as I change either the starting point or the position of the goal the timing will be off, and whatever the modell learned about 4000 milliseconds will no longer hold true. I’m over simplifying a lot here (a LSTM could for example learn in discrete steps instead of a continues flow), but in any case I decided to go with a simple approach.

The model I ended up with is a feedforward neural network that is presented with sensor data from an agent and tries to predict the best direction for that agent in its current environment. In short, every time the model is asked to decide on a direction for an agent the data sent to the model only tells what the agent have “seen”, nothing about how it have been moving. Things like speed and acceleration is left out of the data provided to the model. This works surprisingly well however, with one caveat: since the data doesn’t tell anything about the speed of the agent the model sometimes predicts the completely opposite direction. To handle this I have added some friction into the agents movements controller so that the agents always prefer to continue forward or turn at maximum 90 degrees.

To get a better understanding of the data these plots are useful. The figure to the far right shows the different directions used to navigate the maze from start to finish. The other two figures shows how these directions correlate to sensor data values.

The first thing to note about designing neural networks is that if the data is linearly separable there is no need for any hidden layers. This means that if I’m able to plot my training data and separate the different directions (my labels) by straight lines a input layer (where the training data enters the model) and a output layer (where the predictions are outputted) would be enough. I feel that it is not unreasonable to think that data representing dots moving on a flat surface could be represented linearly by splitting the data in different sets and use multiple models for predicting different directions. But since I wanted to use only one model for predicting all possible directions (up, down, left right) at once it felt logical to include hidden layers instead of trying to split the dataset over several models.

I ended up using two hidden layers. The upside of having two hidden layers is that it can represent any function and it reduces the need for having many neurons in a single hidden layer. The downside is that it adds to the complexity, training takes longer time and there is a risk of overfitting.

The layers in the network, as they are configured in Tensorflow/Keras and as a simple graph. The first layer (input) takes eight data points (left, left-up, up, up-right, right, right-down, down, down-left) and the last layer (output) gives estimations for four directions (left, up, right, down).

The output

To decide which direction to travel is a classification problem. As mentioned earlier in this article the labels sent in with the training data (the features) are integers, where every number represents a direction. After experimenting a bit with the different types of classifiers I chose sparse categorical crossentropy as output. This is a multi class approach, and the predicted probabilities for every class are presented in an array. Say for example that the right direction for an agent is left, and that the class “left” is represented as the number zero (0) when presented as a label. In that case, to be accurate, the model should return the highest value for position zero in the output array.

On this figure: The input data shows that the agent have good margins to the left and up, and the neural network correctly outputs left as the best direction to go.

Batch sizes

What is a good batch size for this type of training? I did some experimentation with finding a dynamic value for batch sizes (like, for example, the number of rows in the training data divided by the number of agents), but in the end I just settled with a nice round number of 100 rows per batch, which seemed to give a reasonable result.


Initially I tried to keep the number of epochs down to avoid overfitting, bu

t after doing some research I instead applied an early stopping function as a callback in the fitting function of the model.

early_stoppage = EarlyStopping(
), y, batch_size=100, epochs=15, validation_split=0.2, callbacks=[early_stoppage])

By doing this, and setting the number of epochs to 15, the training stops as soon as the loss starts to increase or when the hard limit of 15 epochs is reached. In most cases when running the simulation the model does somewhere around 7–8 epochs per training set.

Multiple models

The risk of getting stuck at a local optima is very high in a context like this, and to get around this I have used several models working togheter.

Model 1 stops making progress, and model 2 takes over using some of the best training data used for model 1. When model 2 also stops making progress a new model (model 3) is generated and starts from scratch. After awhile the knowledge acquired from model 2 and model 3 are transmitted to model 1.

I’m truly impressed of you made it all the way down here, thank you for reading!