HomeCommunityInternet of Things (IoT) blog
June 18, 2026

Gesture-based touchless infotainment using Arm Ethos-U NPU

Learn how an open-source project uses Arm Ethos-U65 to deliver real-time, offline gesture recognition for automotive infotainment with on-device AI that protects privacy end-to-end

By Fidel Makatia

Share
Reading time 10 minutes

The problem: interaction complexity in modern infotainment systems  

Modern vehicles increasingly rely on touchscreen infotainment systems to provide navigation, communication, and media controls. While these interfaces offer flexibility and rich functionality, they can require drivers to shift visual attention and manually interact with controls.  

The National Highway Traffic Safety Administration (NHTSA) reports that distracted driving remains a significant safety concern and contributes to thousands of fatalities each year. As vehicle interfaces continue to evolve, improving how drivers interact with these systems remains an important area of focus.  

Voice assistants help reduce manual interaction, but their performance can vary in noisy environments. Some implementations also depend on cloud connectivity, which may introduce latency or limit availability in low-connectivity scenarios.  

These considerations highlight an opportunity to explore complementary interaction methods. These methods can operate in real time, preserve user privacy, and function reliably without continuous connectivity whilst also maintaining a focus on usability, safety, and cost-effectiveness.  

The challenge: reducing driver distraction in modern infotainment systems  

Modern in-vehicle infotainment systems increasingly rely on touch-based interfaces, which in some scenarios may require drivers to shift their attention away from the road. The NHTSA reported that distracted driving contributed to 3,700 fatalities in 2022, highlighting the broader need for safer and more intuitive interaction models inside the vehicle.  

Existing interaction methods, such as voice assistants, can help reduce manual input. However, they can face limitations in certain environments such as in noisy cabins, when connectivity is limited, or where latency impacts responsiveness. Cloud-based gesture recognition approaches can introduce additional considerations, including network latency, typically 100–500 ms, and data privacy implications when processing video streams remotely.  

This creates an opportunity to explore alternative interaction approaches. One such approach is on-device gesture-based control, which can support real-time responsiveness, operate without cloud dependency, and keep data processing local. This project is not intended to replace existing systems. Instead, it demonstrates what is possible when edge AI is combined with efficient hardware to support new, complementary user interaction experiences.  

Why it matters: privacy, speed, and always-on control  

Touchless gesture control is not just a convenience feature, it is a safety-critical interface for next-generation vehicles. Drivers can change tracks, adjust volume, accept or reject phone calls without looking away from the road.  

The Arm Ethos-U NPU enables complex hand landmark models to execute in milliseconds, fundamentally changing what is possible for in-cabin gesture recognition at the edge.  

Comparison of cloud-based systems vs. edge AI solution  

Challenge 

Cloud-Based Systems 

Our Edge AI Solution 

Privacy  

Cabin video streamed to remote servers  

All processing on-device  

Latency  

100-500ms network round-trip  

<10ms NPU inference  

Reliability  

Fails without connectivity  

Works 100% offline  

Cost  

Ongoing cloud compute charges  

One-time hardware cost  

Data Security  

Vulnerable to interception in transit  

No network exposure  

Power  

Requires cellular modem active  

<1W NPU consumption  

The solution: Real-time gesture recognition powered by Arm Ethos-U NPU  

This open-source project implements a touchless infotainment HMI running on the NXP i.MX93 FRDM board with the Arm Ethos-U65 NPU. A USB camera captures hand gestures which are classified into 8 distinct commands that control media playback, phone calls, and volume through a web-based dashboard.  

Touchless infotainment system recognizing an open-palm gesture to control media playback.

Key implementation highlights  

  • Ethos-U NPU acceleration: A 2-stage pipeline runs entirely on the Ethos-U65 NPU using INT8 quantization and Arm's Vela compiler. The pipeline performs palm detection followed by 21-point landmark extraction. Combined inference completes in about 6ms which supports real-time processing at more than 30 FPS.  
  • 8-Gesture vocabulary with temporal filtering: The system recognizes 8 distinct gestures through geometric analysis of hand landmarks:  
    • Open Palm → Play or Pause  
    • Fist → Mute or Unmute  
    • Swipe Left or Right → Previous or Next Track  
    • Thumbs Up or Down → Accept or Reject Call  
    • Pinch In or Out → Volume Down or Up  
  • Privacy-focused architecture: All video processing occurs on-device. The system does not transmit video frames externally. A Gaussian blur is applied to the entire camera frame except the detected hand region. Faces and bodies are never visible, including on the live dashboard.  
  • Multiple output modes: A responsive web dashboard with live camera feed, media player, phone interface, and NPU performance metrics, all accessible from any browser on the local network.  

Hardware setup  

The Ethos-U65 NPU delivers up to 1 TOPS (tera operations per second) of neural network inference performance with minimal power consumption, making it ideal for always-on in-cabin gesture monitoring.  

Hardware specifications

Component 

Specification 

NPU  

Arm Ethos-U65 (256 MAC units)  

Processor  

Arm Cortex-A55 + Cortex-M33  

Board  

NXP i.MX93 FRDM  

Camera  

USB Webcam (640x480)  

Display  

HDMI Monitor (optional)  

Memory  

2GB RAM  

Storage  

MicroSD or eMMC  

Arm Ethos-U NPU: The key enabler  

The Ethos-U65 NPU is designed for efficient edge neural network inference. Key advantages include:  

  • Optimized operator support: The Arm Vela compiler maps TensorFlow Lite operations directly to NPU hardware. Depthwise Convolutions, standard convolutions, fully connected layers, pooling operations, and activation functions all execute without CPU fallback.  
  • Memory efficiency: The Ethos-U architecture minimizes data movement between memory and compute units. The combined gesture models require about 3.5 MB of memory and fit within the NPU's memory constraints.  
  • Power efficiency: Dedicated NPU hardware consumes less than 1W. This is important for always-on automotive systems that must operate within thermal and power constraints.  

Model optimization with Vela  

The hand detection models are prepared for Ethos-U execution using Arm's Vela compiler:  

vela hand_landmark_int8.tflite \  
    --accelerator-config ethos-u65-256 \  
    --system-config Ethos_U65_High_End \  
    --memory-mode Dedicated_Sram \  
    --optimise Performance 

The Arm Vela analyzes the model graph and fuses operations where possible, generating optimized command streams for the NPU. The compilation achieves 100% NPU operator coverage and requires no CPU fallback operations.  

Performance metrics  

Metric 

Value 

Notes 

Palm Detection Inference  

≈3ms  

INT8 on Ethos-U65-256  

Hand Landmark Inference  

≈3ms  

INT8 on Ethos-U65-256  

Total Pipeline  

≈6ms  

Both models combined  

End-to-End Latency  

<50ms  

Camera to gesture action  

Frame Rate  

30+ FPS  

Real-time processing  

Combined Model Size  

≈3.5MB  

Vela-optimized INT8  

NPU Power Consumption  

<1W  

Typical operation  

Gesture Accuracy  

>90%  

Controlled conditions  

False Trigger Rate  

<5%  

With 300ms temporal filter  

Gesture recognition algorithm  

The algorithm analyzes 21 hand landmarks to classify gestures using a combination of geometric analysis and temporal filtering. Gestures fall into 2 categories:  

Static gestures are classified from finger extension states in a single frame. Dynamic gestures track landmark motion across multiple frames.  

The system triggers when either condition is met:  

  • Instant gestures (swipes, pinches): The system confirms these gestures immediately because they are inherently temporal and self-filtering  
  • Hold gestures (palm, fist, thumbs): Users must hold these gestures continuously for 300ms before the system triggers them 

Detection logic thresholds  

Gesture 

Detection Logic 

Threshold 

Open Palm  

Extended finger count  

>= 4 fingers  

Fist  

Extended finger count  

0 fingers  

Thumbs Up  

Thumb only extended + Y position  

Tip above MCP joint  

Thumbs Down  

Thumb only extended + Y position  

Tip below MCP joint  

Swipe Left/Right  

Wrist horizontal displacement  

> 0.15 normalized, velocity > 0.02/frame  

Pinch In  

Thumb-index distance decreasing  

Crosses below 0.06  

Pinch Out  

Thumb-index distance increasing  

Crosses above 0.12  

Finger extension detection  

Extended: Finger tip Y < Finger PIP Y - margin  
Curled:   Finger tip Y >= Finger PIP Y - margin  
Thumb:    Tip-to-wrist distance > MCP-to-wrist distance  
  
Open Palm = [T:ext, I:ext, M:ext, R:ext, P:ext]  
Fist      = [T:curl, I:curl, M:curl, R:curl, P:curl]  
Thumbs Up = [T:ext, I:curl, M:curl, R:curl, P:curl] + thumb.tip.y < thumb.mcp.y 

Swipe detection  

Conditions: 

  • Horizontal wrist displacement > 0.15 (normalized)
  • Vertical displacement < 50% of horizontal (directional constraint)  
  • Average velocity > 0.02 per frame
  • Measured over 8-15 frame window  

Get started  

Repository: https://github.com/fidel-makatia/Gesture_detection_on_Ethos_NPU 

Quick start  

  1. Clone the repository to your Ethos-U enabled board.  
  2. Install dependencies: pip3 install numpy opencv-python flask.  
  3. Verify NPU is available: ls /dev/ethosu0.  
  4. Download models: ./scripts/download_models.sh.  
  5. Compile for NPU: ./scripts/compile_models.sh.  
  6. Run the system: cd deploy && python3 app.py.  
  7. Open dashboard: http://<board-ip>:5000.  

Resources

About the author  

Fidel Makatia, PhD, Texas A&M University Distinguished Arm Ambassador, specializes in integrated circuits design and optimized machine learning model deployment on Arm-based hardware. Former Autodesk Software Engineer with deep expertise in hardware.


Log in to like this post
Share

Article text

Re-use is only permitted for informational and non-commercial or personal use only.

placeholder