Home

Community

Internet of Things (IoT) blog

May 28, 2026

ASL sign language to text using Arm Ethos-U NPU

Edge AI ASL recognition on Arm Ethos-U65 delivers low-latency, offline sign language translation with on-device privacy.

By Fidel Makatia

Reading time 18 minutes

The problem

 The World Health Organization estimates that over 70 million deaf people worldwide use sign language as their primary communication method. However,  most hearing people do not understand sign language, which creates a persistent communication barrier in healthcare, education, public services, and daily interactions. Existing solutions do not meet these needs and human interpreters are expensive and scarce. The Registry of Interpreters for the Deaf reports severe shortages across the United States. Cloud-based sign language recognition apps require a constant internet connection. They can add 200ms to2s of latency that breaks conversational flow and raise privacy concerns because they stream video of private conversations to remote servers. This challenge requires a sign language recognition system that operates in real time and works fully offline. The system must preserve the privacy for both the signer and the viewer, All within the power and cost constraints of an embedded device. 

Why it matters

Accessibility should not depend on the cloud. Sign language recognition is not a novelty feature. It is a critical accessibility tool. A deaf person at a hospital reception, a student in a classroom, or a customer at a service counter should be able to communicate without relying on interpreter availability, internet connectivity, or cloud infrastructure. The Arm Ethos-U NPU enables a full CNN classification model to execute ASL alphabet recognition in under 4 ms. This speed is fast enough to classify each video frame in real time while consuming minimal power. 

Comparison: Cloud-based systems vs. Edge AI solution 

Challenge	Cloud-Based Systems	Edge AI Solution
Privacy	Video of sign language streamed to remote servers	All processing on-device
Latency	200ms–2s network round-trip	~3.5ms NPU inference
Reliability	Fails without connectivity	Works 100% offline
Cost	Ongoing API/subscription fees	One-time hardware cost
Conversational Flow	Latency breaks natural pace	Real-time response
Data Security	Sensitive video stored on third-party servers	Never leaves the device

The solution

Real-time ASL recognition powered by Arm Ethos-U NPU. This open-source project implements real-time American Sign Language alphabet recognition on the NXP i.MX93 FRDM board with the Arm Ethos-U65 NPU. A USB camera captures hand signs which are classified into 29 distinct ASL alphabet characters. These include 26 letters, space, delete, and nothing with detected letters accumulating into readable text through a live web dashboard. 

Live ASL recognition dashboard with blurred background, hand detection overlay, detected letter output, and real-time inference metrics.

Key implementation highlights 

Ethos-U NPU acceleration: A CNN classification model runs on the Ethos-U65 NPU via the Ethos-U delegate with INT8 quantization and the Arm Vela compiler. Inference completes in about 3.5ms per frame which enables real-time classification at more than 20 FPS. 
29-character ASL alphabet with temporal filtering: The system recognizes all 26 ASL alphabet letters and 3 control characters through CNN-based image classification. A–Z appends letters to the text output, space inserts a space, delete removes the last character, and nothing indicates that no action or hand is present). 
Privacy-first architecture: All video processing occurs on-device. No frames are transmitted externally. The system applies a Gaussian blur to the full camera frame except the hand region-of-interest. Faces and surroundings are never visible, including on the live dashboard. 
Live video dashboard: The responsive web interface includes an MJPEG video feed with a blurred background and a sharp ROI box with a detection overlay, real-time text accumulation, alphabet grid highlighting the active letter, and emoji-annotated terminal output . You can access the dashboard from any browser on the local network.

Terminal output showing real-time ASL inference logs with confidence scores, response times, and API status messages from the embedded system.

Hardware setup

The Ethos-U65 NPU delivers up to 1 TOPS (Tera Operations Per Second) of neural network inference performance with minimal power consumption. This makes it suitable for always-on accessibility devices that must respond instantly to sign language input. 

Hardware specifications

Component	Specification
NPU	Arm Ethos-U65 (256 MAC units)
Processor	Arm Cortex-A55 + Cortex-M33
Board	NXP i.MX93 FRDM
Camera	USB Webcam (640x480)
Display	HDMI Monitor (optional)
Memory	2GB RAM
Storage	MicroSD or eMMC

Arm Ethos-U NPU: the key enabler 

The Ethos-U65 NPU is designed for efficient edge neural network inference. For sign language recognition, its advantages are particularly compelling: 

Optimized operator support: The Arm Vela compiler maps TensorFlow Lite operations directly to NPU hardware. Standard Convolutions, Batch Normalization, MaxPooling, and activation functions all execute on the NPU without CPU fallback. The Conv2D and pooling layers that perform most of the computation run entirely on dedicated NPU silicon. 
Memory efficiency: The Ethos-U architecture minimizes data movement between memory and compute units. During inference, the ASL model uses just 158 KiB of SRAM and 359 KiB of DRAM. This fits within the NPU memory constraints. 
Power efficiency: Dedicated NPU hardware consumes under 1 W. This is critical for portable accessibility devices and embedded systems that operate within strict thermal and power budgets.

Model optimization with Vela

The ASL classification model is prepared for Ethos-U execution using Arm's Vela compiler: 

vela asl_model_int8.tflite \  
    --accelerator-config ethos-u65-256 \  
    --system-config Ethos_U65_High_End \  
    --memory-mode Dedicated_Sram \  
    --config vela.ini \  
    --output-dir /tmp/

Vela analyzes the model graph and fuses operations where possible, generating optimized command streams for the NPU. The compilation report confirms high NPU utilization: 

Network summary for asl_model_int8 
  Total SRAM used           158.55 KiB 
  Total DRAM used           359.20 KiB 
  NPU operators              43 (89.6%) 
  CPU operators                5 (10.4%) 
  Batch Inference time       0.29 ms 
 

The 5 CPU operators include Flatten reshape operations and 2 Fully Connected layers. These operations perform zero-compute data reshaping and matrix multiplication on the Cortex-A55 CPU. All convolution, batch normalizations, and pooling layers run on the NPU.

The --memory-mode Dedicated_Sram flag is required for correct NPU results on the i.MX93 platform. Using Shared_Sram produces models that load and execute without errors but generate incorrect inference output. This failure mode can be difficult to diagnose. 

Performance metrics

Metric	Value	Notes
NPU Inference Time	~3.5ms	INT8 on Ethos-U65-256
NPU Operator Coverage	89.6%	43/48 ops on NPU
CPU Operators	5	Flatten reshape + Dense layers
End-to-End Latency	<50ms	Camera to text update
Frame Rate	20+ FPS	Real-time processing
Model Size	4.5 MB	INT8 quantized
NPU Power Consumption	<1W	Typical operation
Supported Characters	29	A–Z + space, delete, nothing
Confidence Threshold	70%	Configurable per deployment
Temporal Filter	3 frames	Consecutive same-letter requirement

Sign language recognition pipeline 

The system follows a multi-stage pipeline optimized for real-time operation on resource-constrained hardware: 

Image preprocessing 

Diagram of the ASL image preprocessing pipeline showing USB camera input, ROI extraction with Gaussian blur, image resizing to 32 × 32 BGR, and INT8 quantization before inference.

The model was trained with OpenCV cv2.imread(), which reads images in BGR format. The inference pipeline preserves this format. Frames from cv2.VideoCapture are used directly without converting the image to RGB. Consistent image formatting between training and inference is required for correct results. Converting to RGB before inference produces incorrect classifications. 

Temporal filtering 

Raw frame-by-frame classification can be noisy. Hand transitions, motion blur, and ambiguous poses can produce flickering predictions. The system applies temporal filtering to improve detection stability and reduce unintended inputs: 

A letter is only registered when it appears with at least 70% confidence for 3 consecutive frames. 
The nothing class is never registered because it indicates that no hand is present. 
The space class inserts a space character. The del class removes the last character. 
This filtering reduces false triggers while maintaining responsive detection.

CNN model architecture 

The classification model is a standard CNN trained on the Kaggle ASL Alphabet dataset: 

Diagram of the CNN model architecture showing convolution, pooling, batch normalization, dropout, flatten, and dense layers for ASL character classification.

The model accepts 32 x 32 BGR images and outputs a 29-class softmax probability distribution. INT8 post-training quantization was performed from the original Keras HDF5 model. This preserves the correct BatchNorm running statistics. The model was converted with the TFLite's converter using representative dataset calibration. The quantized output is already a valid probability distribution. No additional softmax is applied during inference. 

Quick start 

Clone the repository to your Ethos-U enabled board: ASL_to_Text_with_Ethos_U65_NPU. 
Install dependencies: pip3 install numpy opencv-python flask. 
Verify NPU is available: ls /dev/ethosu0. 
Copy the model and compile with Vela using --memory-mode Dedicated_Sram. 
Run the system: python3 -u app.py. 
Open dashboard: http://<board-ip>:5004. 
Position your hand inside the ROI box and sign ASL letters.

Resources

About the author

Fidel Makatia, PhD, Texas A&M University Distinguished Arm Ambassador, specializes in integrated circuits design and optimized machine learning model deployment on Arm-based hardware. Former Autodesk Software Engineer with deep expertise in hardware IP including SerDes and NPUs, plus extensive embedded systems background. 

By Fidel Makatia

Article text

Re-use is only permitted for informational and non-commercial or personal use only.