HomeCommunityInternet of Things (IoT) blog
May 28, 2026

ASL sign language to text using Arm Ethos-U NPU

Edge AI ASL recognition on Arm Ethos-U65 delivers low-latency, offline sign language translation with on-device privacy.

By Fidel Makatia

Share
Reading time 18 minutes

The problem

 The World Health Organization estimates that over 70 million deaf people worldwide use sign language as their primary communication method. However, most hearing people do not understand sign language, which creates a persistent communication barrier in healthcare, education, public services, and daily interactions. Existing solutions do not meet these needs and human interpreters are expensive and scarce. The Registry of Interpreters for the Deaf reports severe shortages across the United States. Cloud-based sign language recognition apps requirea constant internet connection. They can add 200ms to2s of latency that breaks conversational flow and raise privacy concerns because they stream video of private conversations to remote servers. This challenge requires a sign language recognition system that operates in real time and works fully offline. The system must preserve the privacy for both the signer and the viewer, All within the power and cost constraints of an embedded device. 

Why it matters

Accessibility should not depend onthe cloud. Sign language recognition is not a novelty featureIt is a critical accessibility tool. A deaf person at a hospital reception, a student in a classroom, or a customer at a service counter should be able to communicate without relying on interpreter availability, internet connectivity, or cloud infrastructure. The Arm Ethos-U NPU enables a full CNN classification model to execute ASL alphabet recognition in under 4 ms. This speed is fast enough to classify each video frame in real time while consuming minimal power. 

Comparison: Cloud-based systems vs. Edge AI solution  

Challenge 

Cloud-Based Systems 

Edge AI Solution 

Privacy  

Video of sign language streamed to remote servers  

All processing on-device  

Latency  

200ms–2s network round-trip  

~3.5ms NPU inference  

Reliability  

Fails without connectivity  

Works 100% offline  

Cost  

Ongoing API/subscription fees  

One-time hardware cost  

Conversational Flow  

Latency breaks natural pace  

Real-time response  

Data Security  

Sensitive video stored on third-party servers  

Never leaves the device  

The solution

Real-time ASL recognition powered by Arm Ethos-U NPU. This open-source project implements real-time American Sign Language alphabet recognition on the NXP i.MX93 FRDM board with the Arm Ethos-U65 NPU. A USB camera captures hand signs which are classified into 29 distinct ASL alphabet characters. These include 26 lettersspace, delete, and nothing with detected letters accumulating into readable text through a live web dashboard. 

Live ASL recognition dashboard with blurred background, hand detection overlay, detected letter output, and real-time inference metrics.

Key implementation highlights 

  • Ethos-U NPU acceleration: A CNN classification model runs on the Ethos-U65 NPU via the Ethos-U delegate with INT8 quantization and the Arm Vela compiler. Inference completes in about 3.5ms per frame which enables real-time classification at more than 20 FPS.  
  • 29-character ASL alphabet with temporal filtering: The system recognizes all 26 ASL alphabet letters and control characters through CNN-based image classification. A–Z appends letters to the text outputspace inserts a  spacedelete removes the last character, and nothing indicates that no action or hand is present).  
  • Privacy-first architecture: All video processing occurs on-device. No frames are transmitted externally. The system applies a Gaussian blur to the full camera frame except the hand region-of-interestFaces and surroundings are never visible, including on the live dashboard.  
  • Live video dashboard:The responsive web interface includes an MJPEG video feed with a blurred background and a sharp ROI box with a detection overlay, real-time text accumulation, alphabet grid highlighting the active letter, and emoji-annotated terminal output . You can access the dashboard from any browser on the local network.  

Terminal output showing real-time ASL inference logs with confidence scores, response times, and API status messages from the embedded system.

Hardware setup

The Ethos-U65 NPU delivers up to 1 TOPS (Tera Operations Per Second) of neural network inference performance with minimal power consumptionThis makes it suitable for always-on accessibility devices that must respond instantly to sign language input.  

Hardware specifications

Component 

Specification 

NPU  

Arm Ethos-U65 (256 MAC units)  

Processor  

Arm Cortex-A55 + Cortex-M33  

Board  

NXP i.MX93 FRDM  

Camera  

USB Webcam (640x480)  

Display  

HDMI Monitor (optional)  

Memory  

2GB RAM  

Storage  

MicroSD or eMMC  

Arm Ethos-U NPU: the key enabler  

The Ethos-U65 NPU is designed for efficient edge neural network inference. For sign language recognition, its advantages are particularly compelling:  

  • Optimized operator support: The Arm Vela compiler maps TensorFlow Lite operations directly to NPU hardware. Standard Convolutions, Batch Normalization, MaxPooling, and activation functions all execute on the NPU without CPU fallback. The Conv2D and pooling layers that perform most of the computation run entirely on dedicated NPU silicon.  
  • Memory efficiency: The Ethos-U architecture minimizes data movement between memory and compute units. During inference, the ASL model uses just 158 KiB of SRAM and 359 KiB of DRAM. This fits within the NPU memory constraints.  
  • Power efficiency: Dedicated NPU hardware consumes under 1 W. This is critical for portable accessibility devices and embedded systems that operate within strict thermal and power budgets.  

Model optimization with Vela

The ASL classification model is prepared for Ethos-U execution using Arm's Vela compiler:  

vela asl_model_int8.tflite \  
    --accelerator-config ethos-u65-256 \  
    --system-config Ethos_U65_High_End \  
    --memory-mode Dedicated_Sram \  
    --config vela.ini \  
    --output-dir /tmp/  

 

Vela analyzes the model graph and fuses operations where possible, generating optimized command streams for the NPU. The compilation report confirms high NPU utilization:  

Network summary for asl_model_int8  
  Total SRAM used           158.55 KiB  
  Total DRAM used           359.20 KiB  
  NPU operators              43 (89.6%)  
  CPU operators                5 (10.4%)  
  Batch Inference time       0.29 ms  
 

The 5 CPU operators include Flatten reshape operations and Fully Connected layers. These operations perform zero-compute data reshaping and matrix multiplication on the Cortex-A55 CPU. All convolution, batch normalizations, and pooling layers run on the NPU.

The --memory-mode Dedicated_Sram flag is required for correct NPU results on the i.MX93 platform. Using Shared_Sram produces models that load and execute without errors but generate incorrect inference output. This failure mode can be difficult to diagnose. 

Performance metrics

Metric 

Value 

Notes 

NPU Inference Time  

~3.5ms  

INT8 on Ethos-U65-256  

NPU Operator Coverage  

89.6%  

43/48 ops on NPU  

CPU Operators  

5  

Flatten reshape + Dense layers  

End-to-End Latency  

<50ms  

Camera to text update  

Frame Rate  

20+ FPS  

Real-time processing  

Model Size  

4.5 MB  

INT8 quantized  

NPU Power Consumption  

<1W  

Typical operation  

Supported Characters  

29  

A–Z + space, delete, nothing  

Confidence Threshold  

70%  

Configurable per deployment  

Temporal Filter  

3 frames  

Consecutive same-letter requirement  

Sign language recognition pipeline  

The system follows a multi-stage pipeline optimized for real-time operation on resource-constrained hardware:  

Image preprocessing 

Diagram of the ASL image preprocessing pipeline showing USB camera input, ROI extraction with Gaussian blur, image resizing to 32 × 32 BGR, and INT8 quantization before inference.

The model was trained with OpenCV cv2.imread(), which reads images in BGR format. The inference pipeline preserves this format. Frames from cv2.VideoCapture are used directly without converting the image to RGB. Consistent image formatting between training and inference is required for correct results. Converting to RGB before inference produces incorrect classifications.  

Temporal filtering  

Raw frame-by-frame classification can be noisy. Hand transitions, motion blur, and ambiguous poses can produce flickering predictions. The system applies temporal filtering to improve detection stability and reduce unintended inputs:  

  • A letter is only registered when it appears with at least 70% confidence for 3 consecutive frames.  
  • The nothing class is never registered because it indicates that no hand is present.  
  • The space class inserts a space character. The del class removes the last character.  
  • This filtering reduces false triggers while maintaining responsive detection. 

CNN model architecture  

The classification model is a standard CNN trained on the Kaggle ASL Alphabet dataset:  

Diagram of the CNN model architecture showing convolution, pooling, batch normalization, dropout, flatten, and dense layers for ASL character classification.

The model accepts 32 x 32 BGR images and outputs a 29-class softmax probability distribution. INT8 post-training quantization was performed from the original Keras HDF5 modelThis preserves the correct BatchNorm running statistics. The model was converted with theTFLite's converter using representative dataset calibration. The quantized output is already a valid probability distributionNo additionalsoftmax is applied during inference.  

Quick start  

  • Clone the repository to your Ethos-U enabled board: ASL_to_Text_with_Ethos_U65_NPU.  
  • Install dependencies: pip3 install numpy opencv-python flask.  
  • Verify NPU is available: ls /dev/ethosu0.  
  • Copy the model and compile with Vela using --memory-mode Dedicated_Sram.  
  • Run the system: python3 -u app.py.  
  • Open dashboard: http://<board-ip>:5004.  
  • Position your hand inside the ROI box and sign ASL letters.  

Resources

About the author

Fidel Makatia, PhD, Texas A&M University Distinguished Arm Ambassador, specializes in integrated circuits design and optimized machine learning model deployment on Arm-based hardware. Former Autodesk Software Engineer with deep expertise in hardware IP including SerDes and NPUs, plus extensive embedded systems background.  


Log in to like this post
Share

Article text

Re-use is only permitted for informational and non-commercial or personal use only.

placeholder