Artificial Intelligence is no longer limited to cloud servers, GPUs, or high-performance Linux systems. Modern microcontrollers are now capable of running optimized machine learning inference directly at the edge — with surprisingly small memory and power budgets.
This shift has created one of the fastest-growing domains in embedded systems engineering: TinyML — Machine Learning on Microcontrollers.
Today, embedded systems can perform gesture recognition, anomaly detection, predictive maintenance, audio classification, sensor fusion, keyword spotting, and environmental monitoring using only a low-power STM32 microcontroller.
In this article, we will build a complete TinyML workflow using STM32, Edge Impulse, TensorFlow Lite for Microcontrollers, and STM32CubeIDE.
Unlike most beginner-level tutorials, this guide focuses heavily on embedded system constraints, memory optimization, RTOS integration, real-time architecture, inference performance, and production-oriented design considerations.
If you already understand STM32 and FreeRTOS fundamentals, this article will help you enter the rapidly growing world of embedded AI systems.
STM32 Nucleo-F446RE Development Board | MPU6050 Accelerometer/Gyroscope Sensor
What is TinyML?
TinyML refers to running machine learning models directly on resource-constrained embedded hardware such as ARM Cortex-M microcontrollers, battery-powered IoT devices, and low-power edge sensors. Instead of sending raw sensor data to the cloud, inference happens locally on the device itself.
Traditional AI workflow:
Sensor → Cloud → AI Processing → Response
TinyML workflow:
Sensor → Local MCU Inference → Immediate Action
This architecture dramatically reduces latency, bandwidth usage, cloud dependency, and power consumption — while improving privacy, responsiveness, and offline capability.
Why TinyML Matters in Embedded Systems
The embedded industry is rapidly moving toward intelligent edge devices, local decision making, predictive analytics, and autonomous sensing systems. Modern IoT systems increasingly require faster response time, lower power consumption, real-time processing, and offline operation. TinyML solves these problems effectively.
Typical applications include industrial vibration analysis, wearable fitness devices, gesture-controlled interfaces, predictive maintenance systems, smart agriculture, and battery-powered monitoring systems.
Why STM32 is Excellent for TinyML
STM32 microcontrollers are especially suitable for TinyML because they provide ARM Cortex-M cores with DSP instructions, floating-point support, low-power modes, CMSIS-DSP libraries, CMSIS-NN acceleration, and strong ecosystem support.
Popular STM32 families for TinyML:
- STM32L4 — ultra-low power, ideal for battery-powered applications
- STM32F4 — Cortex-M4 with FPU, excellent price/performance
- STM32F7 — Cortex-M7, higher performance workloads
- STM32H7 — ideal for heavier inference tasks
For practical TinyML workloads, Cortex-M4 and M7 provide excellent performance. The STM32H7 is ideal for heavier inference tasks.
Why Use Edge Impulse?
Edge Impulse dramatically simplifies the TinyML development workflow. Instead of manually building datasets, preprocessing pipelines, TensorFlow training scripts, quantization logic, and deployment integration — Edge Impulse automates most of the pipeline.
Key features include browser-based ML workflow, sensor data collection, automatic feature extraction, model training, quantized TensorFlow Lite export, and embedded deployment packages. This significantly reduces development complexity for embedded engineers.
TinyML System Architecture
A real TinyML embedded system contains several stages:
Sensor →
Signal Acquisition →
Window Buffering →
Feature Extraction →
ML Inference →
Decision Logic →
Application Response
Example for gesture recognition:
Accelerometer →
FFT Processing →
Neural Network →
Gesture Classification →
LED Action / MQTT Event
Important: TinyML is much more than simply running a neural network. The surrounding embedded system design matters equally.
Project Overview
In this project, we will build a Gesture Recognition system using STM32 and Edge Impulse. The system will classify idle, left movement, right movement, and upward motion using accelerometer sensor data — without requiring cloud connectivity.
Hardware Requirements
STM32 Development Boards
- STM32 Nucleo-F446RE — View on Amazon
- STM32 Nucleo-H743ZI — View on Amazon
- STM32L476RG
Sensors
- MPU6050 — View on Amazon
- LIS3DH — View on Amazon
- BMI160
Software Requirements
Install the following tools:
- STM32CubeIDE
- STM32CubeMX
- Python 3.x
- Node.js
- Edge Impulse CLI
Install Edge Impulse CLI:
npm install -g edge-impulse-cli
Verify installation:
edge-impulse-daemon
Understanding the TinyML Workflow
Step 1 — Sensor Data Acquisition
Sensor acquisition is the foundation of any TinyML system. The quality of collected data directly impacts model accuracy, robustness, and production reliability.
Example accelerometer read:
int16_t accel_x = read_accel_x();
int16_t accel_y = read_accel_y();
int16_t accel_z = read_accel_z();
Important Dataset Collection Advice
One of the biggest beginner mistakes is collecting overly clean laboratory data. Real embedded systems operate under vibration, electrical noise, different orientations, varying temperatures, and inconsistent user behavior.
Always collect data from multiple users, different movement speeds, noisy conditions, and orientation variations. This dramatically improves inference reliability.
Sampling Considerations
| Parameter | Typical Value |
|---|---|
| Sampling Rate | 50–200 Hz |
| Window Size | 1–2 seconds |
| Window Overlap | 25–50% |
| Quantization | INT8 |
Higher sampling rates improve signal quality but increase RAM usage, preprocessing cost, model input size, and inference latency. Example calculation:
100Hz × 3-axis × 2 bytes × 1 second = 600 bytes raw sensor data
After preprocessing and feature extraction, memory usage grows significantly — this becomes critical on microcontrollers.
Step 2 — Designing the Edge Impulse Pipeline
Inside Edge Impulse: create a new project, upload collected sensor data, and configure the impulse.
Typical configuration:
Input Window: 1000ms
Window Increase: 200ms
Processing Block: Spectral Features
Learning Block: Classification
Why Feature Extraction Matters
TinyML performance depends heavily on preprocessing quality. Instead of feeding raw accelerometer data directly, FFT transforms signals into the frequency domain where dominant patterns become easier to classify and model complexity reduces. On resource-constrained MCUs, preprocessing is often more important than the neural network itself.
Step 3 — Model Training
Edge Impulse automatically creates a TensorFlow model, quantized inference model, and embedded deployment package.
Typical network architecture:
Input Layer →
Dense Layer →
ReLU →
Dense Layer →
Softmax Output
Understanding Quantization
Quantization converts float32 to int8:
| Data Type | Size |
|---|---|
| float32 | 4 bytes |
| int8 | 1 byte |
This provides a 4x memory reduction — critical for STM32 systems. Benefits include lower RAM usage, smaller flash footprint, faster inference, and reduced power consumption.
Model Accuracy vs Embedded Constraints
A common misconception is that larger neural networks always perform better. In embedded systems, larger models increase RAM usage, inference latency, and power consumption, while RTOS responsiveness suffers. TinyML engineering is fundamentally about balancing accuracy against embedded constraints.
Step 4 — Exporting the Model
Edge Impulse can export a TensorFlow Lite Micro library, C++ inference package, and preprocessing code. Select C++ Library as the export format — this package integrates directly into STM32CubeIDE.
STM32 Project Structure
Typical project structure:
/Core
/Drivers
/EdgeImpulse
/TFLite
/Application
Separating ML components cleanly is important for maintainability, scalability, and firmware updates.
Step 5 — Running Inference on STM32
Core inference flow:
signal_t signal;
ei_impulse_result_t result;
EI_IMPULSE_ERROR res =
run_classifier(&signal, &result, false);
The classifier processes the input signal, extracted features, and neural network inference, returning classification probabilities.
Understanding Tensor Arena
TensorFlow Lite Micro uses static memory allocation with no dynamic heap allocation. A tensor arena must be allocated manually:
constexpr int tensor_arena_size = 60 * 1024;
static uint8_t tensor_arena[tensor_arena_size];
If the arena is too small, inference fails as tensors cannot allocate. If too large, RAM is wasted, RTOS stack pressure increases, and system instability may occur. Production embedded systems require careful RAM budgeting.
Measuring TinyML Performance
| Metric | Importance |
|---|---|
| Inference Time | Real-time response |
| RAM Usage | System stability |
| Flash Usage | Firmware size |
| CPU Load | RTOS responsiveness |
| Power Consumption | Battery life |
Example Benchmark — STM32F446RE
- Model Size: 45 KB
- Tensor Arena: 60 KB
- Inference Time: ~18 ms
- CPU Clock: 180 MHz
This performance is sufficient for gesture recognition, vibration analysis, and anomaly detection.
Integrating TinyML with FreeRTOS
TinyML works extremely well alongside FreeRTOS. Recommended architecture:
Sensor ISR →
Queue →
Inference Task →
Decision Task →
Cloud / UI / Logging
This architecture maintains deterministic sampling, RTOS responsiveness, and modularity.
FreeRTOS Task Priorities
| Task | Priority |
|---|---|
| Sensor Sampling | High |
| Inference Task | Medium |
| MQTT Publish | Medium |
| Logging | Low |
Inference should never block critical interrupts, watchdog servicing, or real-time control loops.
Example Inference Task
void vInferenceTask(void *pvParameters)
{
while(1)
{
if(xQueueReceive(xSensorQueue,
&sample,
portMAX_DELAY))
{
run_classifier(&signal,
&result,
false);
if(result.classification[0].value > 0.8f)
{
trigger_event();
}
}
}
}
TinyML Optimization Techniques
1. CMSIS-NN Acceleration
ARM CMSIS-NN provides optimized neural network kernels with SIMD acceleration, lower latency, and improved efficiency. This significantly improves Cortex-M performance.
2. INT8 Quantization
Always prefer INT8 inference unless model accuracy becomes unacceptable.
3. DSP Optimization
Fixed-point DSP often performs better than floating point on Cortex-M systems. Key optimization areas include FFT, filtering, normalization, and feature extraction.
Power Optimization for TinyML
TinyML workloads can increase CPU activity, sensor usage, and battery drain. Optimization techniques include reducing sampling frequency, running inference periodically, using interrupt-based wakeup, dynamic clock scaling, and burst processing.
Example Low-Power Architecture:
Sleep →
Motion Interrupt →
Wake MCU →
Capture Samples →
Run Inference →
Return to Sleep
This is ideal for wearables, battery-powered sensors, and smart trackers.
Common Beginner Mistakes
- Oversized Models — Large models often exceed MCU memory limits
- Ignoring RAM Usage — RAM exhaustion causes hard faults, RTOS crashes, and stack corruption
- Poor Dataset Diversity — Real-world performance becomes unreliable
- Blocking Inference — Inference must never run inside ISRs or high-priority timing loops
- Ignoring Quantization — Float models are usually impractical on MCUs
Real Production Challenges
Production TinyML systems must handle sensor drift, electrical noise, corrupted samples, watchdog recovery, firmware updates, and field calibration. TinyML is not only “training a model” — it requires robust embedded systems engineering.
Real-World TinyML Applications
- Industrial: predictive maintenance, motor vibration analysis, anomaly detection
- Consumer Electronics: gesture recognition, smart wearables, voice interfaces
- Automotive: sensor fusion, driver monitoring, cabin intelligence
- Medical Devices: portable diagnostics, biosignal analysis
Future of TinyML on STM32
The industry trend is clear — more AI at the edge, less cloud dependency, lower power intelligent systems, and local real-time inference. STM32 combined with TensorFlow Lite Micro, CMSIS-NN, Edge Impulse, and FreeRTOS creates an extremely powerful embedded AI platform.
Engineers who understand TinyML, RTOS, DSP, low-power optimization, and embedded architecture will be highly valuable in the coming years.
Final Thoughts
TinyML is rapidly becoming a core part of modern embedded systems engineering. Today’s STM32 microcontrollers are fully capable of neural network inference, sensor classification, and intelligent decision making. However, the real challenge is not simply running a neural network — the real challenge is managing RAM, minimizing latency, maintaining RTOS responsiveness, optimizing power, and building reliable production systems.
That is where embedded engineering expertise becomes critical.