Skip to content
iD
InfoDive Labs
Back to blog
AI/MLIoTEdge Computing

Edge AI: Running ML Models on IoT Devices

Learn how to deploy machine learning models on edge and IoT devices, covering model optimization, hardware selection, frameworks like TensorFlow Lite, and real-world use cases.

August 29, 20246 min read

Edge AI: Running ML Models on IoT Devices

Sending every sensor reading to the cloud for processing is slow, expensive, and often impractical. A security camera generating 30 frames per second produces roughly 1.5 GB of data per hour. Multiply that by hundreds of cameras across a facility and the bandwidth costs alone become prohibitive, to say nothing of the latency. Edge AI solves this by running machine learning models directly on the device or on a nearby edge gateway, processing data where it is generated.

The applications are compelling: real-time quality inspection on the factory floor, predictive maintenance on remote wind turbines, autonomous navigation for drones, and intelligent traffic management at intersections. This post covers the practical challenges of deploying ML models on resource-constrained edge devices and the techniques that make it work.

The Edge AI Stack

Running ML inference on an edge device involves several layers, each with its own constraints:

Hardware ranges from microcontrollers with kilobytes of RAM (Arduino, ESP32) to edge GPUs with significant compute power (NVIDIA Jetson series, Google Coral). Your model architecture and latency requirements determine which hardware tier you need.

Runtime/framework provides the inference engine optimized for the target hardware. Common options include:

  • TensorFlow Lite for microcontrollers and mobile devices
  • ONNX Runtime for cross-platform deployment
  • TensorRT for NVIDIA edge GPUs
  • Apache TVM for compiler-based optimization across hardware targets
  • OpenVINO for Intel hardware (CPUs, VPUs, FPGAs)

Model must be optimized to fit within the device's memory and compute budget while maintaining acceptable accuracy. This is where most of the engineering effort goes.

Application logic handles pre-processing, inference orchestration, post-processing, and communication with cloud services for aggregated analytics and model updates.

Model Optimization Techniques

A model trained on a GPU server with 80GB of VRAM will not run on a device with 512MB of RAM and no GPU. Several optimization techniques bridge this gap, each trading some accuracy for significant reductions in model size and inference latency.

Quantization reduces the numerical precision of model weights and activations from 32-bit floating point to 8-bit integers or even lower. Post-training quantization requires no additional training and typically reduces model size by 4x with less than 1% accuracy loss. Quantization-aware training inserts simulated quantization during training, achieving better accuracy at low bit-widths.

import tensorflow as tf
 
# Post-training quantization with TFLite
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
 
# Full integer quantization (requires representative dataset)
def representative_dataset():
    for data in calibration_data:
        yield [data.astype(np.float32)]
 
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
 
tflite_model = converter.convert()
 
# Save: typically 4x smaller than original
with open("model_quantized.tflite", "wb") as f:
    f.write(tflite_model)

Pruning removes weights that contribute least to the model's output, creating a sparse network that requires fewer computations. Structured pruning removes entire filters or attention heads, which maps directly to faster execution on standard hardware. Unstructured pruning removes individual weights and requires sparse matrix support in the runtime to realize speed benefits.

Knowledge distillation trains a small "student" model to mimic the outputs of a large "teacher" model. The student learns not just from the hard labels in the dataset but from the teacher's soft probability distributions, which contain more information about the relationships between classes. A distilled model can achieve 90-95% of the teacher's accuracy at a fraction of the size.

Architecture design starts with models built for efficiency rather than trying to shrink large models. MobileNet, EfficientNet-Lite, and SqueezeNet use depthwise separable convolutions and inverted residuals to achieve strong accuracy with far fewer parameters and FLOPs than standard architectures.

Hardware Selection Guide

Choosing the right edge hardware depends on your model complexity, latency requirements, power budget, and cost constraints.

Microcontrollers (Cortex-M, ESP32, RP2040):

  • Memory: 256KB - 2MB RAM
  • Suitable for: keyword spotting, simple anomaly detection, gesture recognition
  • Power: milliwatts, can run on batteries for years
  • Cost: $2-10 per unit
  • Framework: TensorFlow Lite Micro, Edge Impulse

Application processors (Raspberry Pi, i.MX8):

  • Memory: 1-8GB RAM
  • Suitable for: image classification, object detection (single camera), NLP on small models
  • Power: 2-15 watts
  • Cost: $15-75 per unit
  • Framework: TFLite, ONNX Runtime, PyTorch Mobile

Edge AI accelerators (Jetson Orin Nano, Google Coral, Hailo-8):

  • Memory: 4-16GB RAM (Jetson) or dedicated accelerator memory
  • Suitable for: real-time multi-camera video analytics, complex object detection, segmentation
  • Power: 5-25 watts
  • Cost: $100-500 per unit
  • Framework: TensorRT (Jetson), Edge TPU (Coral), Hailo SDK

Here is a practical example of running object detection on an NVIDIA Jetson device with TensorRT optimization:

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import cv2
 
class EdgeDetector:
    def __init__(self, engine_path):
        self.logger = trt.Logger(trt.Logger.WARNING)
        with open(engine_path, "rb") as f:
            self.engine = trt.Runtime(self.logger).deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()
 
        # Allocate device memory
        self.inputs, self.outputs, self.bindings = [], [], []
        for binding in self.engine:
            size = trt.volume(self.engine.get_tensor_shape(binding))
            dtype = trt.nptype(self.engine.get_tensor_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))
            if self.engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:
                self.inputs.append({"host": host_mem, "device": device_mem})
            else:
                self.outputs.append({"host": host_mem, "device": device_mem})
 
    def detect(self, frame):
        # Preprocess
        input_image = cv2.resize(frame, (640, 640))
        input_image = input_image.transpose(2, 0, 1).astype(np.float32) / 255.0
 
        # Copy to device and run inference
        np.copyto(self.inputs[0]["host"], input_image.ravel())
        cuda.memcpy_htod(self.inputs[0]["device"], self.inputs[0]["host"])
        self.context.execute_v2(bindings=self.bindings)
        cuda.memcpy_dtoh(self.outputs[0]["host"], self.outputs[0]["device"])
 
        return self.postprocess(self.outputs[0]["host"])

Over-the-Air Model Updates

Edge devices need to be updated without physical access. A robust OTA update system for ML models includes:

  • Model versioning with rollback capability. If a new model performs poorly, the device should automatically revert to the previous version.
  • Incremental updates that transmit only the changed weights rather than the full model, reducing bandwidth requirements for devices on cellular or satellite connections.
  • A/B testing at the edge where a subset of devices runs the new model while the rest stay on the current version. Compare performance metrics before rolling out fleet-wide.
  • Health checks that validate the new model can load and produce reasonable outputs before activating it for production inference.
# Edge device model management configuration
model_management:
  current_model: "detector_v3.2.tflite"
  update_channel: "stable"
  check_interval_hours: 6
  rollback_policy:
    trigger: "accuracy_drop > 5% OR latency_increase > 20%"
    fallback_model: "detector_v3.1.tflite"
  telemetry:
    report_interval_minutes: 15
    metrics: ["inference_latency_p99", "detection_count", "confidence_distribution"]

Real-World Deployment Considerations

Thermal management is a practical concern that is easy to overlook. Running continuous inference generates heat, and edge devices in enclosed housings or outdoor environments can throttle performance or fail prematurely. Thermal profiling under sustained load is essential during hardware selection.

Power management for battery-powered devices requires careful orchestration. Run inference only when triggered by a low-power sensor (motion detector, sound level threshold) rather than continuously. Duty-cycle the AI accelerator to balance responsiveness with battery life.

Environmental robustness matters for industrial and outdoor deployments. Devices must handle temperature extremes, humidity, vibration, and dust. Industrial-rated enclosures (IP65 or higher) and wide-temperature-range components are not optional for these environments.

Connectivity handling must account for intermittent or absent network connections. Edge devices should function fully offline, queuing telemetry and results for upload when connectivity is restored. Critical alerts may need redundant communication paths (cellular fallback, local mesh networking).

Need help building this?

Our team specializes in turning these ideas into production systems. Let's talk.