Backpropagating through a maze with candle and WASM
This demo uses gradient descent to solve a discrete maze.
Try playing with the hyperparameters to see how they affect the optimization process!
No neural network involved: logits are directly optimized, from a random initialization, for each maze.
This runs entirely on your local device, thanks to candle and Rust's support for WebAssembly. You can disconnect from the Internet after loading this demo and you will still be able to use it!
Appearances can be deceiving: On harder and larger grids, you might find that much time is spent being "stuck", with a dramatic phase transition. Beware! And perhaps try increasing the step count.
Additional Details
This demo solves maze navigation by relaxing the discrete problem into a stochastic formulation that happens to be end-to-end differentiable.
Since every operation is differentiable, we use backpropagation with standard automatic differentiation (i.e. candle's autograd, which runs client-side) to directly optimize action logits, without relying on e.g. the REINFORCE algorithm, Q learning, Monte-Carlo rollouts, or any sort of neural network.
State Representation
There is no "current position" in a discrete sense.
Instead, at any given point, there is a probability distribution over possible positions.
The agent's state is a 2D probability grid \( s_t \in \mathbb{R}^{H \times W} \) where:
- \( s_t[r,c] \) is the probability of being at row \( r \), column \( c \)
- \( \sum_{r=0}^{H-1} \sum_{c=0}^{W-1} s_t[r,c] = 1 \)
This allows us to apply gradient-based optimization to this inherently-discrete problem. By smoothly varying the action distribution, we can smoothly increase our probability of success.
Action Space
At each time step, there are five possible actions:
\[ \mathcal{A} = \{\text{up}, \text{right}, \text{down}, \text{left}, \text{noop}\}. \]
The \( \text{noop} \) action, which represents staying in place, is important in this simple demo.
Without it, one would be forced to take a step at each time step, even if the goal was already reached.
Transition Dynamics
For each action \( a \), we pre-compute transitions between grid positions: \[ T^{(a)}_{(r,c) \to (r',c')} = \begin{cases} 1 & \text{if action } a \text{ moves from } (r,c) \text{ to } (r',c') \\ 0 & \text{otherwise} \end{cases} \]
The actions map to movements:
- up: \( (r,c) \to (r-1,c) \)
- right: \( (r,c) \to (r,c+1) \)
- down: \( (r,c) \to (r+1,c) \)
- left: \( (r,c) \to (r,c-1) \)
- noop: \( (r,c) \to (r,c) \)
Impossible actions (walls or boundaries) loop back: \( T^{(a)}_{(r,c) \to (r,c)} = 1 \).
Yes, this matrix is enormous, and it's mostly sparse.
Why use it, then?
Because it allows us to apply actions using matrix multiplication, and thus to keep autograd happy with minimal hassle.
Direct Logit Parameterization
We use position-independent action probabilities that produce position-dependent behavior through masking.
Our parameters are time-dependent logits: \[ \theta_t \in \mathbb{R}^{|\mathcal{A}|} \]
This is a tensor of shape \( T \times |\mathcal{A}| = T \times 5 \), where \( T \) is the maximum number of time steps. It's represented by the grid that you see evolve on the right-hand side of the maze above.
The parameters \( \theta_t \) don't know anything about positions. They're the same for every cell in the maze at time \( t \).
Does this seem like it shouldn't work? I would agree! I definitely would not parameterize things this way in a serious setting.
For this demo, it is enough to show interesting behavior.
State Evolution
At each timestep, probability mass flows according to: \[ s_{t+1}[r',c'] = \sum_{r=0}^{H-1} \sum_{c=0}^{W-1} \sum_{a \in \mathcal{A}} s_t[r,c] \cdot \pi_t(a|r,c) \cdot T^{(a)}_{(r,c) \to (r',c')} \]
In other words: the probability of being at position \( (r',c') \) next is the sum over all ways to get there. For each possible starting position \( (r,c) \), we take the probability \( s_t[r,c] \) of being there, multiply by the probability \( \pi_t(a|r,c) \) of taking action \( a \), and multiply by \( T^{(a)}_{(r,c) \to (r',c')} \) which is 1 if action \( a \) moves you from \( (r,c) \) to \( (r',c') \) and 0 otherwise.
Here, position dependence emerges (we do need it somewhere): we apply a masked softmax using the same \( \theta_t \) at every position, but only over valid actions: \[ \pi_t(a|r,c) = \begin{cases} \frac{\exp(\theta_t[a])}{\sum_{a' \in V_{r,c}} \exp(\theta_t[a'])} & \text{if } a \in V_{r,c} \\ 0 & \text{otherwise} \end{cases} \]
where \( V_{r,c} \) is the set of valid (non-wall-blocked) actions from position \( (r,c) \).
This creates an effective position-dependent policy \( \pi_t(a|r,c) \) from position-independent parameters \( \theta_t \).
Learning Objective
We simply maximize the probability of reaching the goal (the bottom-right cell) after \( T \) steps: \[ \mathcal{L} = -s_T[H-1, W-1] \]
The time horizon adapts to maze difficulty: \( T = 2 \times \text{shortest\_path\_length} \).
Algorithm
- Initialize \( s_0[0,0] = 1 \) and \( s_0[r,c] = 0 \) elsewhere (start at top-left)
- Initialize parameters \( \theta_1, \ldots, \theta_T \sim \mathcal{N}(0, 0.1) \)
- For each gradient step:
- Compute loss \( \mathcal{L} = -s_T[H-1, W-1] \)
- Update parameters via the Adam optimizer
- Stop if the probability of reaching the goal exceeds a predefined threshold.
Just give me the code
This demo's code is available on GitHub, in case you, too, wish to do client-side backpropagation in Rust.