Brought to you by

Explaining RNNs without neural networks

Terence Parr
Terence is a tech lead at Google and ex-Professor of computer/data science in University of San Francisco's MS in Data Science program and you might know him as the creator of the ANTLR parser generator.

Vanilla recurrent neural networks (RNNs) form the basis of more sophisticated models, such as LSTMs and GRUs. There are lots of great articles, books, and videos that describe the functionality, mathematics, and behavior of RNNs so, don't worry, this isn't yet another rehash. (See below for a list of resources.) My goal is to present an explanation that avoids the neural network metaphor, stripping it down to its essence—a series of vector transformations that result in embeddings for variable-length input vectors.

My learning style involves pounding away at something until I'm able to re-create it myself from fundamental components. This helps me to understand exactly what a model is doing and why it is doing it. You can ignore this article if you're familiar with standard neural network layers and are comfortable with RNN explanations that use them as building blocks. Since I'm still learning the details of neural networks, I wanted to (1) peer through those layers to the matrices and vectors beneath and (2) investigate the details of the training process. My starting point was Karpathy's RNN code snippet associated with The Unreasonable Effectiveness of Recurrent Neural Networks and then I absorbed details from Chapter 12 from Jeremy Howard's / Sylvain Gugger's book Deep Learning for Coders with fastai and PyTorch and Chapter 12 from Andrew Trask's Grokking Deep Learning.

In this article, I hope to contribute a simple and visually-focused data-transformation perspective on RNNs using a trivial data set that maps words for "cat" to the associated natural language. The animation on the right was taken (and speeded up) from a youtube clip I made for this article. For my actual PyTorch-based implementations, I've provided notebooks that use a nontrivial family name to natural language data set. These links open my full implementation notebooks at colab:

Table of contents

I've broken up this article into two main sections. The first section tries to identify how an RNN encodes a variable-length input record as a fixed-length vector by reinventing the mechanism in baby steps. The second section is all about minibatching details and vectorizing the gradient descent training loop.

Implementation Details and Concepts I learned

As I tried to learn RNNs, my brain kept wondering about the implementation details and key concepts, such as what exactly was contained in the hidden state vector. My brain appears to be so literal that it can't understand anything until it sees the entire picture in depth. For those in a hurry, let me summarize some of the key things I learned by implementing RNNs with nothing but matrices and vectors. The full table of contents for the full article appears below.


First off, if you are new to deep learning, check out Jeremy Howard's full course (with video lectures) called Practical Deep Learning for Coders. As for recurrent neural networks in particular, here are a few resources that I found useful:


I'd like to thank Yannet Interian, also faculty in University of San Francisco's MS in Data Science program, for acting as a resource and pointing me to relevant material. Andrew Shaw and Oliver Zeigermann also answered a lot of my questions and filled in lots of implementation details.