Overview

Not every machine learning task runs on an edge device. Some tasks, such as offline video captioning or podcast transcription, are not time-critical and are therefore particularly well-suited to running in the data center; the increase in compute performance available significantly speeds up such tasks. 

This guide shows you how to set up client-server speech transcription deployed as a service running on cloud-hosted Arm servers. Here, you record an audio file to your client machine then upload it to the server. The Arm-based server uses a speech recognition service that utilizes machine learning to convert your speech to text, and then it sends the text back as a file to your client machine.

Important

Deploying a server in the cloud is not free, and you will need to pay a small amount to packet.net to complete this guide. 

Before you begin

This is a technical deployment walkthrough using Ubuntu 16.04, so some familiarity with the command-line, Linux package managers, and SSH is assumed. No knowledge of machine learning is necessary.

The installations and builds that are described in this guide can take several hours to undertake, but once installed the service will be up and running very quickly.

Ensure that your PC has a working microphone as you will need to record your voice for the transcription service to work.

Before you start this guide, you need to create an account at packet.net. Your account may take some time to be verified and costs $1 to create. Once your account is verified, you need to do the following before you can deploy an Arm server:

  1. Create a new packet.net project.
  2. Generate an SSH keys pair and add the public key to your project. This allows you to login securely to the server. Follow these packet.net instructions on how to do this.

Packet is a paid-for cloud-based computing service which provides bare-metal servers. You will need to provide payment details on sign-up. We'll be using the Type 2A server (Cavium Thunder X), which costs $0.50 as of April 2018. The computing cost for this guide is approximately $3.

Deploy an Arm server

  1. Log into packet.net and either create a new project or open an existing one.
  2. Click on the Servers tab and then select the Deploy servers button.
  3. Enter a suitable hostname. Note that the hostname is for your reference only and does not need to be tied to a registered domain.
  4. From the Location dropdown, choose one of these options:
    • NRT1
    • SJC1
    • EWR1
    Arm-based servers are only available in these locations.
  5. From the Type dropdown, select a c1.large.arm server.
  6. From the OS dropdown, select Ubuntu 16.04 LTS, as shown here:
  7. Select Deploy Servers. This process takes approximately 5-10 minutes.

    New servers will be created with the SSH public key you provided in your project settings. You will need this to log in after it boots. Check this now, or add an extra one by clicking the "SSH & user data" options button.

    Once your server has booted, its IP address is shown on the Servers page.
  8. Open a command line and replace <ip address> with the one provided to log in to the server using this command:
    ssh root@<ip address>
    If you can login successfully, then we will revisit this command further on in this guide to deploy and run a machine learning demo on your server.

Build an ML framework for Arm

The framework you choose may depend on the application you wish to run. This example uses Baidu's DeepSpeech 2, a state-of-the-art speech recognition system that provides very high-quality models for both English and Chinese.

DeepSpeech 2 is built on Baidu's PaddlePaddle framework. Although less well-known than TensorFlow, it is just as easy to build and configure on an Armv8 system.

The instructions provided here work on Ubuntu 16.04 LTS running on a packet.net type 2A server. For reference, the official guide for building PaddlePaddle from source is here: https://github.com/PaddlePaddle/DeepSpeech.

Install dependencies

Most dependencies are already pre-built for Armv8. Logged in to the server, enter the following to a command line to obtain and install the dependencies from standard repositories:

apt-get update
apt-get -y install python-dev python-pip python-numpy python-scipy python-wheel git cmake swig golang libfreetype6-dev libpng12-dev libopenblas-dev
pip install protobuf
git clone https://github.com/PaddlePaddle/recordio.git
cd recordio/python
./build.sh
pip install -e .
cd ../..

Build PaddlePaddle

Building from source can take several hours. Once complete, you can copy the *.whl file and deploy it directly onto subsequent servers. To build from source:

git clone https://github.com/PaddlePaddle/Paddle.git
cd Paddle
mkdir build
cd build
cmake -DWITH_GPU=OFF -DWITH_TESTING=OFF ..
make -j96
cd ../.. 

Install PaddlePaddle

Installing the built package is straightforward.

  1. Display the contents of the python/dist directory by entering the ls command. Note the version number in the name of the .whl file.
  2. Enter this command and add the version number that you noted in step 1:

    pip install Paddle/build/python/dist/paddlepaddle*.whl

Installing DeepSpeech 2 for Arm

Baidu's DeepSpeech network provides state-of-the-art speech-to-text capabilities. Their PaddlePaddle-based implementation comes with state-of-the-art models that have been trained on their internal >8000 hour English speech dataset. Mandarin versions are also available.

Mozilla host a TensorFlow-based version of DeepSpeech, but the model files available for it are trained on small public datasets and offer significantly lower accuracy than Baidu's internally-trained ones.

The remainder of this section provides a condensed guide on installing DeepSpeech2, tested on Ubuntu 16.04 LTS running on a packet.net Type 2A server. To install it on another platform, follow Baidu's general installation guide.

Install dependencies

Once you have a working PaddlePaddle installation, install the additional DeepSpeech dependencies. These are mostly audio codecs:

apt-get install -y pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev libffi-dev

Build DeepSpeech

DeepSpeech's requirements.txt file specifies particular scipy and Cython versions, which will automatically be built from source. These builds can take longer than an hour, so while this is happening, download the models (see next section), which also takes a long time.

git clone https://github.com/PaddlePaddle/DeepSpeech.git
cd DeepSpeech
bash setup.sh

Download models while building

These two files are large (400MB and 8GB) so it is useful to start downloading these while the previous build step is in progress. To do this, open a new command line, login to the server using SSH as before, navigate to the Paddle/build/python/dist directory and enter:

cd DeepSpeech/models/baidu_en8k
bash download_model.sh
cd ../lm
bash download_lm_en.sh
cd ../../

Build speech manifest

The librispeech manifest is used by the demo server to provide warmup examples. Scripts to download it are provided:

cd data/librispeech
ln -s ../../data_utils data_utils
python librispeech.py --full_download=False
cd ../..

Install the speech-to-text demo

Install demo on client

You need install the demo on your local machine. This is the guide for a MacBook Pro installation of the client. Although DeepSpeech must be cloned, it does not need to be built or installed on the client.

To install the demo, enter:

brew install portaudio
pip install pyaudio
pip install pynput
git clone https://github.com/PaddlePaddle/DeepSpeech.git
cd DeepSpeech

The client listens for keypresses from the keyboard space and escape keys. This does not work on all devices including the MacBook Pro. To amend this, you can modify deploy/demo_client.py to use ctrl for record and shift for exit:

sed -i '' s/space/ctrl/g deploy/demo_client.py
sed -i '' 's/Key\.esc/Key\.shift/g' deploy/demo_client.py

Prepare demo server

On the server, set port 8000 to listen for connections:

apt-get -y install ufw
ufw allow ssh
ufw allow 8000

It is good practice to block ports that are not in use, which UFW does automatically.

Start demo server

Again, on the packet.net server enter the following to start the demo server and replace SERVER_IP below with the IP address of the server and run this from the DeepSpeech/ directory:

CUDA_VISIBLE_DEVICES=0 \
 python -u deploy/demo_server.py \
 --host_ip='SERVER_IP' \
 --host_port=8000 \
 --num_conv_layers=2 \
 --num_rnn_layers=3 \
 --rnn_layer_size=1024 \
 --alpha=1.15 \
 --beta=0.15 \
 --cutoff_prob=1.0 \
 --cutoff_top_n=40 \
 --use_gru=True \
 --use_gpu=False \
 --share_rnn_weights=False \
 --speech_save_dir='demo_cache' \
 --mean_std_path='models/baidu_en8k/mean_std.npz' \
 --vocab_path='models/baidu_en8k/vocab.txt' \
 --model_path='models/baidu_en8k/params.tar.gz' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --decoding_method='ctc_beam_search' \
 --specgram_type='linear'

Run the speech-to-text demo

On your host machine, replace SERVER_IP with the IP address of the server:

python -u deploy/demo_client.py --host_ip '<SERVER_IP>' --host_port 8000

After the client has connected, press and hold space (or ctrl if you modified the client demo) to talk. Once you release the space bar, a recording of your speech will be sent to the server, processed, returned, and then printed. This takes around 4x the length of the speech itself. Press escape (or shift) to exit.

Next steps

The Arm ecosystem provides robust support for many state-of-the-art machine learning frameworks and applications. This demo would not be suitable for interactive assistant speech recognition, but with 96 cores available on a Cavium Thunder X server such as the one used here, 24 hours of English or Mandarin speech can be transcribed with state-of-the-art accuracy for just $0.50!

More exciting use cases will continue to develop as an increasingly wide range of next-generation Arm servers become available on the cloud.

Watch this space!