Edge AI: Running ML Models on IoT Devices

Sending every sensor reading to the cloud for processing is slow, expensive, and often impractical. A security camera generating 30 frames per second produces roughly 1.5 GB of data per hour. Multiply that by hundreds of cameras across a facility and the bandwidth costs alone become prohibitive, to say nothing of the latency. Edge AI solves this by running machine learning models directly on the device or on a nearby edge gateway, processing data where it is generated.

The applications are compelling: real-time quality inspection on the factory floor, predictive maintenance on remote wind turbines, autonomous navigation for drones, and intelligent traffic management at intersections. This post covers the practical challenges of deploying ML models on resource-constrained edge devices and the techniques that make it work.

The Edge AI Stack

Running ML inference on an edge device involves several layers, each with its own constraints:

Hardware ranges from microcontrollers with kilobytes of RAM (Arduino, ESP32) to edge GPUs with significant compute power (NVIDIA Jetson series, Google Coral). Your model architecture and latency requirements determine which hardware tier you need.

Runtime/framework provides the inference engine optimized for the target hardware. Common options include:

TensorFlow Lite for microcontrollers and mobile devices
ONNX Runtime for cross-platform deployment
TensorRT for NVIDIA edge GPUs
Apache TVM for compiler-based optimization across hardware targets
OpenVINO for Intel hardware (CPUs, VPUs, FPGAs)

Model must be optimized to fit within the device's memory and compute budget while maintaining acceptable accuracy. This is where most of the engineering effort goes.

Application logic handles pre-processing, inference orchestration, post-processing, and communication with cloud services for aggregated analytics and model updates.

Model Optimization Techniques

A model trained on a GPU server with 80GB of VRAM will not run on a device with 512MB of RAM and no GPU. Several optimization techniques bridge this gap, each trading some accuracy for significant reductions in model size and inference latency.

Quantization reduces the numerical precision of model weights and activations from 32-bit floating point to 8-bit integers or even lower. Post-training quantization requires no additional training and typically reduces model size by 4x with less than 1% accuracy loss. Quantization-aware training inserts simulated quantization during training, achieving better accuracy at low bit-widths.

import tensorflow as tf
 
# Post-training quantization with TFLite
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
 
# Full integer quantization (requires representative dataset)
def representative_dataset():
    for data in calibration_data:
        yield [data.astype(np.float32)]
 
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
 
tflite_model = converter.convert()
 
# Save: typically 4x smaller than original
with open("model_quantized.tflite", "wb") as f:
    f.write(tflite_model)

Pruning removes weights that contribute least to the model's output, creating a sparse network that requires fewer computations. Structured pruning removes entire filters or attention heads, which maps directly to faster execution on standard hardware. Unstructured pruning removes individual weights and requires sparse matrix support in the runtime to realize speed benefits.

Knowledge distillation trains a small "student" model to mimic the outputs of a large "teacher" model. The student learns not just from the hard labels in the dataset but from the teacher's soft probability distributions, which contain more information about the relationships between classes. A distilled model can achieve 90-95% of the teacher's accuracy at a fraction of the size.

Architecture design starts with models built for efficiency rather than trying to shrink large models. MobileNet, EfficientNet-Lite, and SqueezeNet use depthwise separable convolutions and inverted residuals to achieve strong accuracy with far fewer parameters and FLOPs than standard architectures.

Hardware Selection Guide

Choosing the right edge hardware depends on your model complexity, latency requirements, power budget, and cost constraints.

Microcontrollers (Cortex-M, ESP32, RP2040):

Memory: 256KB - 2MB RAM
Suitable for: keyword spotting, simple anomaly detection, gesture recognition
Power: milliwatts, can run on batteries for years
Cost: $2-10 per unit
Framework: TensorFlow Lite Micro, Edge Impulse

Application processors (Raspberry Pi, i.MX8):

Memory: 1-8GB RAM
Suitable for: image classification, object detection (single camera), NLP on small models
Power: 2-15 watts
Cost: $15-75 per unit
Framework: TFLite, ONNX Runtime, PyTorch Mobile

Edge AI accelerators (Jetson Orin Nano, Google Coral, Hailo-8):

Memory: 4-16GB RAM (Jetson) or dedicated accelerator memory
Suitable for: real-time multi-camera video analytics, complex object detection, segmentation
Power: 5-25 watts
Cost: $100-500 per unit
Framework: TensorRT (Jetson), Edge TPU (Coral), Hailo SDK

Here is a practical example of running object detection on an NVIDIA Jetson device with TensorRT optimization:

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import cv2
 
class EdgeDetector:
    def __init__(self, engine_path):
        self.logger = trt.Logger(trt.Logger.WARNING)
        with open(engine_path, "rb") as f:
            self.engine = trt.Runtime(self.logger).deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()
 
        # Allocate device memory
        self.inputs, self.outputs, self.bindings = [], [], []
        for binding in self.engine:
            size = trt.volume(self.engine.get_tensor_shape(binding))
            dtype = trt.nptype(self.engine.get_tensor_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))
            if self.engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:
                self.inputs.append({"host": host_mem, "device": device_mem})
            else:
                self.outputs.append({"host": host_mem, "device": device_mem})
 
    def detect(self, frame):
        # Preprocess
        input_image = cv2.resize(frame, (640, 640))
        input_image = input_image.transpose(2, 0, 1).astype(np.float32) / 255.0
 
        # Copy to device and run inference
        np.copyto(self.inputs[0]["host"], input_image.ravel())
        cuda.memcpy_htod(self.inputs[0]["device"], self.inputs[0]["host"])
        self.context.execute_v2(bindings=self.bindings)
        cuda.memcpy_dtoh(self.outputs[0]["host"], self.outputs[0]["device"])
 
        return self.postprocess(self.outputs[0]["host"])

Over-the-Air Model Updates

Edge devices need to be updated without physical access. A robust OTA update system for ML models includes:

Model versioning with rollback capability. If a new model performs poorly, the device should automatically revert to the previous version.
Incremental updates that transmit only the changed weights rather than the full model, reducing bandwidth requirements for devices on cellular or satellite connections.
A/B testing at the edge where a subset of devices runs the new model while the rest stay on the current version. Compare performance metrics before rolling out fleet-wide.
Health checks that validate the new model can load and produce reasonable outputs before activating it for production inference.

# Edge device model management configuration
model_management:
  current_model: "detector_v3.2.tflite"
  update_channel: "stable"
  check_interval_hours: 6
  rollback_policy:
    trigger: "accuracy_drop > 5% OR latency_increase > 20%"
    fallback_model: "detector_v3.1.tflite"
  telemetry:
    report_interval_minutes: 15
    metrics: ["inference_latency_p99", "detection_count", "confidence_distribution"]

Real-World Deployment Considerations

Thermal management is a practical concern that is easy to overlook. Running continuous inference generates heat, and edge devices in enclosed housings or outdoor environments can throttle performance or fail prematurely. Thermal profiling under sustained load is essential during hardware selection.

Power management for battery-powered devices requires careful orchestration. Run inference only when triggered by a low-power sensor (motion detector, sound level threshold) rather than continuously. Duty-cycle the AI accelerator to balance responsiveness with battery life.

Environmental robustness matters for industrial and outdoor deployments. Devices must handle temperature extremes, humidity, vibration, and dust. Industrial-rated enclosures (IP65 or higher) and wide-temperature-range components are not optional for these environments.

Connectivity handling must account for intermittent or absent network connections. Edge devices should function fully offline, queuing telemetry and results for upload when connectivity is restored. Critical alerts may need redundant communication paths (cellular fallback, local mesh networking).