Gesture-based touchless infotainment using Arm Ethos-U NPU
Learn how an open-source project uses Arm Ethos-U65 to deliver real-time, offline gesture recognition for automotive infotainment with on-device AI that protects privacy end-to-end

The problem: interaction complexity in modern infotainment systems
Modern vehicles increasingly rely on touchscreen infotainment systems to provide navigation, communication, and media controls. While these interfaces offer flexibility and rich functionality, they can require drivers to shift visual attention and manually interact with controls.
The National Highway Traffic Safety Administration (NHTSA) reports that distracted driving remains a significant safety concern and contributes to thousands of fatalities each year. As vehicle interfaces continue to evolve, improving how drivers interact with these systems remains an important area of focus.
Voice assistants help reduce manual interaction, but their performance can vary in noisy environments. Some implementations also depend on cloud connectivity, which may introduce latency or limit availability in low-connectivity scenarios.
These considerations highlight an opportunity to explore complementary interaction methods. These methods can operate in real time, preserve user privacy, and function reliably without continuous connectivity whilst also maintaining a focus on usability, safety, and cost-effectiveness.
The challenge: reducing driver distraction in modern infotainment systems
Modern in-vehicle infotainment systems increasingly rely on touch-based interfaces, which in some scenarios may require drivers to shift their attention away from the road. The NHTSA reported that distracted driving contributed to 3,700 fatalities in 2022, highlighting the broader need for safer and more intuitive interaction models inside the vehicle.
Existing interaction methods, such as voice assistants, can help reduce manual input. However, they can face limitations in certain environments such as in noisy cabins, when connectivity is limited, or where latency impacts responsiveness. Cloud-based gesture recognition approaches can introduce additional considerations, including network latency, typically 100–500 ms, and data privacy implications when processing video streams remotely.
This creates an opportunity to explore alternative interaction approaches. One such approach is on-device gesture-based control, which can support real-time responsiveness, operate without cloud dependency, and keep data processing local. This project is not intended to replace existing systems. Instead, it demonstrates what is possible when edge AI is combined with efficient hardware to support new, complementary user interaction experiences.
Why it matters: privacy, speed, and always-on control
Touchless gesture control is not just a convenience feature, it is a safety-critical interface for next-generation vehicles. Drivers can change tracks, adjust volume, accept or reject phone calls without looking away from the road.
The Arm Ethos-U NPU enables complex hand landmark models to execute in milliseconds, fundamentally changing what is possible for in-cabin gesture recognition at the edge.
Comparison of cloud-based systems vs. edge AI solution
|
Challenge |
Cloud-Based Systems |
Our Edge AI Solution |
|
Privacy |
Cabin video streamed to remote servers |
All processing on-device |
|
Latency |
100-500ms network round-trip |
<10ms NPU inference |
|
Reliability |
Fails without connectivity |
Works 100% offline |
|
Cost |
Ongoing cloud compute charges |
One-time hardware cost |
|
Data Security |
Vulnerable to interception in transit |
No network exposure |
|
Power |
Requires cellular modem active |
<1W NPU consumption |
The solution: Real-time gesture recognition powered by Arm Ethos-U NPU
This open-source project implements a touchless infotainment HMI running on the NXP i.MX93 FRDM board with the Arm Ethos-U65 NPU. A USB camera captures hand gestures which are classified into 8 distinct commands that control media playback, phone calls, and volume through a web-based dashboard.

Key implementation highlights
- Ethos-U NPU acceleration: A 2-stage pipeline runs entirely on the Ethos-U65 NPU using INT8 quantization and Arm's Vela compiler. The pipeline performs palm detection followed by 21-point landmark extraction. Combined inference completes in about 6ms which supports real-time processing at more than 30 FPS.
- 8-Gesture vocabulary with temporal filtering: The system recognizes 8 distinct gestures through geometric analysis of hand landmarks:
- Open Palm → Play or Pause
- Fist → Mute or Unmute
- Swipe Left or Right → Previous or Next Track
- Thumbs Up or Down → Accept or Reject Call
- Pinch In or Out → Volume Down or Up
- Privacy-focused architecture: All video processing occurs on-device. The system does not transmit video frames externally. A Gaussian blur is applied to the entire camera frame except the detected hand region. Faces and bodies are never visible, including on the live dashboard.
- Multiple output modes: A responsive web dashboard with live camera feed, media player, phone interface, and NPU performance metrics, all accessible from any browser on the local network.
Hardware setup
The Ethos-U65 NPU delivers up to 1 TOPS (tera operations per second) of neural network inference performance with minimal power consumption, making it ideal for always-on in-cabin gesture monitoring.
Hardware specifications
|
Component |
Specification |
|
NPU |
Arm Ethos-U65 (256 MAC units) |
|
Processor |
Arm Cortex-A55 + Cortex-M33 |
|
Board |
NXP i.MX93 FRDM |
|
Camera |
USB Webcam (640x480) |
|
Display |
HDMI Monitor (optional) |
|
Memory |
2GB RAM |
|
Storage |
MicroSD or eMMC |
Arm Ethos-U NPU: The key enabler
The Ethos-U65 NPU is designed for efficient edge neural network inference. Key advantages include:
- Optimized operator support: The Arm Vela compiler maps TensorFlow Lite operations directly to NPU hardware. Depthwise Convolutions, standard convolutions, fully connected layers, pooling operations, and activation functions all execute without CPU fallback.
- Memory efficiency: The Ethos-U architecture minimizes data movement between memory and compute units. The combined gesture models require about 3.5 MB of memory and fit within the NPU's memory constraints.
- Power efficiency: Dedicated NPU hardware consumes less than 1W. This is important for always-on automotive systems that must operate within thermal and power constraints.
Model optimization with Vela
The hand detection models are prepared for Ethos-U execution using Arm's Vela compiler:
vela hand_landmark_int8.tflite \
--accelerator-config ethos-u65-256 \
--system-config Ethos_U65_High_End \
--memory-mode Dedicated_Sram \
--optimise Performance
The Arm Vela analyzes the model graph and fuses operations where possible, generating optimized command streams for the NPU. The compilation achieves 100% NPU operator coverage and requires no CPU fallback operations.
Performance metrics
|
Metric |
Value |
Notes |
|
Palm Detection Inference |
≈3ms |
INT8 on Ethos-U65-256 |
|
Hand Landmark Inference |
≈3ms |
INT8 on Ethos-U65-256 |
|
Total Pipeline |
≈6ms |
Both models combined |
|
End-to-End Latency |
<50ms |
Camera to gesture action |
|
Frame Rate |
30+ FPS |
Real-time processing |
|
Combined Model Size |
≈3.5MB |
Vela-optimized INT8 |
|
NPU Power Consumption |
<1W |
Typical operation |
|
Gesture Accuracy |
>90% |
Controlled conditions |
|
False Trigger Rate |
<5% |
With 300ms temporal filter |
Gesture recognition algorithm
The algorithm analyzes 21 hand landmarks to classify gestures using a combination of geometric analysis and temporal filtering. Gestures fall into 2 categories:
Static gestures are classified from finger extension states in a single frame. Dynamic gestures track landmark motion across multiple frames.
The system triggers when either condition is met:
- Instant gestures (swipes, pinches): The system confirms these gestures immediately because they are inherently temporal and self-filtering
- Hold gestures (palm, fist, thumbs): Users must hold these gestures continuously for 300ms before the system triggers them
Detection logic thresholds
|
Gesture |
Detection Logic |
Threshold |
|
Open Palm |
Extended finger count |
>= 4 fingers |
|
Fist |
Extended finger count |
0 fingers |
|
Thumbs Up |
Thumb only extended + Y position |
Tip above MCP joint |
|
Thumbs Down |
Thumb only extended + Y position |
Tip below MCP joint |
|
Swipe Left/Right |
Wrist horizontal displacement |
> 0.15 normalized, velocity > 0.02/frame |
|
Pinch In |
Thumb-index distance decreasing |
Crosses below 0.06 |
|
Pinch Out |
Thumb-index distance increasing |
Crosses above 0.12 |
Finger extension detection
Extended: Finger tip Y < Finger PIP Y - margin
Curled: Finger tip Y >= Finger PIP Y - margin
Thumb: Tip-to-wrist distance > MCP-to-wrist distance
Open Palm = [T:ext, I:ext, M:ext, R:ext, P:ext]
Fist = [T:curl, I:curl, M:curl, R:curl, P:curl]
Thumbs Up = [T:ext, I:curl, M:curl, R:curl, P:curl] + thumb.tip.y < thumb.mcp.y
Swipe detection
Conditions:
- Horizontal wrist displacement > 0.15 (normalized)
- Vertical displacement < 50% of horizontal (directional constraint)
- Average velocity > 0.02 per frame
- Measured over 8-15 frame window
Get started
Repository: https://github.com/fidel-makatia/Gesture_detection_on_Ethos_NPU
Quick start
- Clone the repository to your Ethos-U enabled board.
- Install dependencies: pip3 install numpy opencv-python flask.
- Verify NPU is available: ls /dev/ethosu0.
- Download models: ./scripts/download_models.sh.
- Compile for NPU: ./scripts/compile_models.sh.
- Run the system: cd deploy && python3 app.py.
- Open dashboard: http://<board-ip>:5000.
Resources
- Arm Ethos-U NPU Documentation
- Vela Compiler for Ethos-U
- ExecuTorch: On-Device AI Framework
- NXP i.MX93 Applications Processor
- Arm NN SDK
- ML Inference Advisor
About the author
Fidel Makatia, PhD, Texas A&M University Distinguished Arm Ambassador, specializes in integrated circuits design and optimized machine learning model deployment on Arm-based hardware. Former Autodesk Software Engineer with deep expertise in hardware.
Re-use is only permitted for informational and non-commercial or personal use only.
