Native ML module for CircuitPython - Running Models from Flash Memory without Recompilation

In my previous article, I shared a workflow for running CNNs in CircuitPython by transpiling models to C and compiling them directly into the firmware. While high-performance, it had one major drawback: flexibility. If you wanted to tweak a single layer or swap models, you had to re-flash the entire firmware.

Today, I’m introducing a more dynamic approach: a native CircuitPython module that allows you to load and run ML models directly from the filesystem (flash memory). No recompilation, no flashing—just drop a .bin file onto your CIRCUITPY drive and run inference.

Why This is a Game Changer

This extension transitions CircuitPython from a “static” ML environment to a “dynamic” one.

  • Model Hot-Swapping: Switch between different digit classifiers or gesture recognizers on the fly.
  • Rapid Prototyping: Update your model weights via a simple USB drag-and-drop.
  • Memory Efficient: Models reside in flash, keeping the limited RAM available for your Python logic.

While it currently supports a tiny subset of operations, it is already powerful enough for Dense networks and small CNNs, opening the door for significant experimentation on the RP2040.

How to Get Started

The source code and detailed documentation are available on GitHub:
👉 code2k13/cp-cnn-extension

Quick Access: Pre-built Binaries

If you don’t want to set up the ARM GCC toolchain, I have provided pre-built CircuitPython binaries with this extension already baked in:

Supported Operations

To keep the footprint small and the execution fast, the library currently focuses on the core building blocks of computer vision and classification:

  • Conv2D: Standard 2D convolution.
  • MaxPool2D: For spatial downsampling.
  • Flatten: Moving from spatial features to classification heads.
  • Relu: The standard activation function.
  • Dense (Linear): Fully connected layers.
  • Softmax: For final probability distributions.

The Workflow: From PyTorch to Flash

To run a model, we need to convert it into a flat binary format (a “blob”) that the C-extension understands.

1. Defining the Model

The key is to keep the model small. In the generate_model_sm.py example, we define a simple network:

import torch.nn as nn
import torch.nn.functional as F

class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.conv1 = nn.Conv2d(1, 4, kernel_size=3, stride=1, padding=1)
self.fc1 = nn.Linear(4 * 30 * 30, 10)

def forward(self, x):
x = F.relu(self.conv1(x))
x = x.view(-1, 4 * 30 * 30)
x = self.fc1(x)
return x

2. Exporting to ONNX

When exporting, look for this specific output in the conversion logs:

model.onnx generated cleanly in native NCHW format. No Transpose nodes!

This is critical. Standard PyTorch models often introduce “Transpose” or “Reshape” nodes that add overhead. By ensuring a clean NCHW (Channels-First) export, the microcontroller can process data linearly without extra memory-shuffling.

3. Generating the Blob

We use the provided onnx_to_blob.py tool to convert the .onnx file into a .bin file. This script also saves a sample input, allowing you to verify the model on the device later.

Full-Fledged Example: MNIST Digit Recognition

To see this in action, follow the sequence in the repository:

  1. Train: Use train_mnist.py to train a digit classifier.
  2. Convert: Run the onnx_to_blob.py script.
  3. Deploy: Use mnist.py on your Pico to load the blob and classify digits.

Lessons Learned: The Battle with Memory Corruption

Developing this was not without its challenges. While working on the display example, I encountered strange memory corruption issues.

The inference would return garbage results as soon as I added complex UI code to the main loop. I discovered that I had to carefully isolate UI modifications into separate functions. Simply having the code inline seemed to interfere with the memory alignment or the stack used by the native library. If you are seeing inconsistent output, try simplifying your Python-side logic and keeping the ML invocation as “isolated” as possible.

Pico W per-sample loop flow Shows the RAM-aware loop: load image, show image, free bitmap, run inference, show result, recreate bitmap Next sample Load .rgb → bitmap show image on LCD Free bitmap + tile gc.collect() — reclaim RAM Run inference load model → infer → unload Recreate bitmap update labels, sleep repeat

Next Steps

The next logical step is real-time vision. I attempted to integrate an OV7670 camera, but unfortunately, my unit failed to power on during testing. Once I secure a replacement, I will post an update with live camera inference.

I hope the team at CircuitPython sees the potential here. Having a standardized circuitpython_ml module would be a massive leap forward for the community.

Until then, happy coding!