,

Dogberry – A Micro Language Model for ESP32


We have been playing with the ESP32 a bit lately, specifically the Lilygo T-Display S3R8, it has been great for a number of projects but we wanted to see how far it could be pushed. So after a few long nights trying to hit the right balance of training data, Dogberry AI was created, a micro language model capable of running on an ESP32. It is trained on Shakespeare’s comedies with an emphasis on the voice of Constable Dogberry from Much Ado About Nothing. His character tends to mess up words, has a bit of over confidence, and a use of malapropisms. The hope was that these characteristics might lend themselves to a small language model like this so that the errors or foibles of our model might feel more like a character mistake than a misunderstanding or limitation of the AI. Then to add a bit more fun to it, we created a Bluesky account for Constable Dogberry so we can all chat with him, just make a post that includes “@constabledogberry.bsky.social” and you will get a reply after a couple minutes, completely run from the ESP32.

The code is available on GitHub and the full writeup below.

Hardware: What We’re Working With

ESP32-S3 (LilyGo T-Display-S3)

  • 240MHz dual-core processor
  • 8MB PSRAM (our primary constraint)
  • 16MB Flash storage
  • 320KB SRAM

For perspective: your phone has ~6GB RAM. We have 8MB. The model needs to fit entirely in memory.

The Architecture

We built a word-level LSTM language model—think GPT’s distant cousin who lives in a tiny apartment.

40-word context window
    ↓
Embedding (4000 vocab → 64 dimensions)
    ↓
LSTM (256 memory cells)
    ↓
Dense layer (256 → 4000 word probabilities)
    ↓
Sample next word with temperature=0.8

1.6 million parameters, 6.15MB—small enough to fit with room to spare.

Why Word-Level Instead of Character-Level?

We tried character-level first (predicting letters). Result: "goode mowrrre frieeend theeee watchhh" — technically Shakespearean, technically gibberish.

Word-level means the model predicts complete words: "Good morrow, friend! the watch must be vigilant..."

The tradeoff? Character-level needs only 68 vocab items. Word-level needs 4,000. That Dense layer (vocab × LSTM units) explodes the parameter count.

The Training Data Recipe

Shakespeare Comedies (71%): 1.45M characters from 12 plays

  • Much Ado About Nothing, Twelfth Night, Midsummer Night’s Dream, etc.
  • Stripped stage directions, character labels, etc.

Dogberry Dialogue (29%): 145K characters

  • Original lines from Much Ado About Nothing
  • Oversampled 4× to prevent Shakespeare from drowning out his voice

Total: 2M characters, ~400K words

Why comedies? Dogberry is comedic. Feeding the model tragedies would give us an existentially depressed constable.

The Vocabulary Dilemma

We wanted 8,000 words. The math said no:

Dense layer = vocab_size × lstm_units × 4 bytes
            = 8000 × 256 × 4
            = 8.2MB

Problem: We only have 8MB PSRAM total.

Dropping to 4,000 words: 6.15MB. Fits with breathing room.

Tokenization: r"\w+|[.,!?;:]" — words and punctuation as separate tokens, all lowercase.

Training: Two Phases

Proof of Concept (20 epochs, 90 minutes)

First, prove word-level even works on this corpus.

  • Started at 3.75 validation loss
  • Ended at 1.32 validation loss
  • 50.78% accuracy predicting the next word

For context: random guessing = 0.025% (1 in 4000). The model learned something.

Production (60 epochs, 4.5 hours)

Same architecture, trained longer with early stopping and learning rate scheduling.

  • Final validation loss: 1.02 (28% better than PoC)
  • ~60% accuracy on validation set
  • Model saved every time validation improved

We used Adam optimizer (lr=0.002), batch size 128, and dropout (0.3) to prevent overfitting on our relatively small corpus.

Getting It On The ESP32

Challenge 1: The Firmware Was Too Big

7.39MB firmware (code + model) won’t fit in ESP32’s default 6.5MB app partition.

Solution: Custom partition table allocating 10MB to the app (ESP32-S3 has 16MB flash total). Firmware now uses 70% of available space.

Challenge 2: Implementing LSTM Inference in C++

No TensorFlow Lite. No PyTorch Mobile. Just raw C++ matrix math.

We hand-coded the LSTM gates:

  • Input gate (what to remember)
  • Forget gate (what to discard)
  • Cell gate (candidate values)
  • Output gate (what to output)

Plus softmax, temperature sampling, and a tokenizer that matches Python’s regex without actually having regex.

Challenge 3: Memory Management

Model weights live in Flash. LSTM state (hidden and cell) allocated in PSRAM using ps_malloc(). Temporary buffers (embeddings, logits, probabilities) are malloc’d and freed every generation to avoid fragmentation.

What It Actually Generates

Input: “Good morrow, friend!”
Output: “Good morrow, friend! i am glad to see thee, and ile be gone, and i will heare…”

Input: “The watch must be vigilant”
Output: “The watch must be vigilant in the world! see some vagrom fellows might question whether i am an honest man…”

Not perfect grammar—but Dogberry himself mangles language. The quirks add character.

Performance Stats

  • Generation speed: 1-2 words/second (CPU-bound on dense layer multiplication)
  • Response time: ~20-30 seconds for a 40-word reply
  • Power: Always-on via USB, ~200mA average
  • Latency: Checks mentions every 60 seconds

What We Learned

  1. Vocabulary is the bottleneck. That dense layer dominates memory usage. Cutting vocab in half saved 5MB.
  2. Train longer. The jump from 20 to 60 epochs was worth it—73% loss reduction with the same architecture.
  3. Corpus balance matters. Without oversampling Dogberry 4×, he’d sound like generic Shakespeare. The personality comes from that 29% minority representation.
  4. Word-level is mandatory for coherent output. Character-level might work for larger models, but not at this scale.
  5. Temperature sampling is magic. Temperature=0.8 gives creative variety without descending into chaos. Lower = repetitive, higher = nonsense.

Future Improvements

If we wanted to take this further:

  • Quantize to int8: Model would shrink to ~1.5MB, freeing memory for bigger vocab
  • Beam search: Generate multiple candidates, pick the best (slower but higher quality)
  • Attention mechanism: Would blow the memory budget, but worth exploring on ESP32-S3 with 16MB PSRAM variants
  • Dynamic vocabulary: Keep only Dogberry-specific words + top 2K common words

The Bottom Line

You can run real language models on microcontrollers. The constraints force creativity: we can’t throw compute at the problem, so we curate data, tune architectures, and accept quirky outputs as features.

The bot is live on Bluesky, generating Shakespearean speech entirely on-device, powered by 8 megabytes of RAM and stubbornness.