Building a Speech-to-Image Android App with QNN Acceleration

Repository:edgeimpulse/example-android-inferencing/qnn-genai-speech_to_image

Build an advanced Android application that converts speech audio into detailed images using Qualcomm’s AI Engine Direct SDK. This tutorial demonstrates real-time speech-to-text processing with Whisper and image generation with Stable Diffusion, all accelerated on Snapdragon hardware.

This tutorial showcases cutting-edge AI capabilities on mobile devices. You’ll build a chat-like interface where users can speak prompts to generate images using hardware-accelerated AI models. Reference code: https://github.com/edgeimpulse/example-android-inferencing/tree/main/qnn-genai-speech_to_image

What you’ll build

An Android application that:

Records speech input using Android’s audio APIs
Processes speech to text using Whisper model accelerated by QNN
Generates detailed images from text prompts using Stable Diffusion
Displays results in a chat-like interface
Runs entirely on-device with hardware acceleration

Time: 2-3 hours
Difficulty: Advanced

Architecture overview

The app uses a hybrid architecture:

Kotlin UI: Handles user interaction, audio recording, and display
C++ Backend: QNN APIs for model inference on Snapdragon HTP
JNI Bridge: Seamless communication between Kotlin and C++
Rust Libraries: Tokenization/detokenization for model inputs/outputs

Prerequisites

Edge Impulse account: Sign up
Trained model: Complete a tutorial first (for custom workflows)
Android Studio: 2024.3.1 or later
Android NDK: Version 27.2.12479018
Rust toolchain: Latest stable with Android targets
Qualcomm AI Engine Direct SDK: v2.40.0 (Download)
Python 3.10: For asset generation scripts
Snapdragon device with Hexagon NPU (see supported devices below)

This app requires significant setup and model assets. Ensure you have a Snapdragon device with adequate storage for the AI models (~500MB+).

Supported devices

Tested on Snapdragon 8 Elite. Compatible with devices featuring Hexagon NPU Gen 2 or later: Mobile:

Snapdragon 8 Gen 3/2/1
Snapdragon 7+ Gen 2/3
Snapdragon 6 Gen 1

Embedded:

QRB6490 (Rubik Pi 3)
QRB5165 (RB5)
QCS2210 (Arduino UNO Q)

1. Clone the repository

Step 1: Clone the repository

git clone https://github.com/edgeimpulse/example-android-inferencing.git
cd example-android-inferencing/qnn-genai-speech_to_image

2. Set up Qualcomm AI Engine Direct SDK

Step 2: Set up Qualcomm AI Engine Direct SDK Download and extract the QNN SDK, then set the environment variable:

export QNN_SDK_ROOT=/path/to/your/qnn/sdk

The SDK provides the necessary libraries and tools for QNN acceleration. Version 2.40.0 is recommended.

3. Install Rust and Android targets

Step 3: Install Rust and Android targets

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

# Add Android target
rustup target add aarch64-linux-android

# Install cargo-ndk
cargo install cargo-ndk

4. Compile Rust tokenization libraries

Step 4: Compile Rust tokenization libraries The app uses Rust libraries for text tokenization. Build them for Android:

# Build tokenizer
cd tokenizer
cargo ndk -t arm64-v8a build --release

# Build detokenizer
cd ../detokenizer
cargo ndk -t arm64-v8a build --release

5. Set up Android project dependencies

Step 5: Set up Android project dependencies

Copy QNN libraries

Create the JNI libs directory and copy required libraries:

mkdir -p speech_to_image/app/src/main/jniLibs/arm64-v8a

Copy these libraries (adjust VXX for your Hexagon version, e.g., V79 for Snapdragon 8 Elite):

# From Android NDK
cp $ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android/libc++_shared.so speech_to_image/app/src/main/jniLibs/arm64-v8a/

# From Rust builds
cp tokenizer/target/aarch64-linux-android/release/libtokenizer.so speech_to_image/app/src/main/jniLibs/arm64-v8a/
cp detokenizer/target/aarch64-linux-android/release/libdetokenizer.so speech_to_image/app/src/main/jniLibs/arm64-v8a/

# From QNN SDK (adjust VXX)
cp $QNN_SDK_ROOT/lib/hexagon-vXX/unsigned/libQnnHtpVXX.so speech_to_image/app/src/main/jniLibs/arm64-v8a/
cp $QNN_SDK_ROOT/lib/hexagon-vXX/unsigned/libQnnHtpVXXSkel.so speech_to_image/app/src/main/jniLibs/arm64-v8a/
cp $QNN_SDK_ROOT/lib/aarch64-android/libQnnHtp.so speech_to_image/app/src/main/jniLibs/arm64-v8a/
cp $QNN_SDK_ROOT/lib/aarch64-android/libQnnHtpPrepare.so speech_to_image/app/src/main/jniLibs/arm64-v8a/
cp $QNN_SDK_ROOT/lib/aarch64-android/libQnnHtpVXXStub.so speech_to_image/app/src/main/jniLibs/arm64-v8a/
cp $QNN_SDK_ROOT/lib/aarch64-android/libQnnSystem.so speech_to_image/app/src/main/jniLibs/arm64-v8a/

Copy QNN headers and utilities

Run the dependency resolution script:

source scripts/resolveDependencies.sh

This copies QNN headers and additional C++ utilities to the project.

Copy additional dependencies

Download and copy required source files:

# Log mel spectrogram utilities
curl -L https://raw.githubusercontent.com/psmdv/log_mel_spectrogram/main/src/log_mel_spectrogram.cpp -o speech_to_image/app/src/main/cpp/speech_to_image/src/LogMel/src/log_mel_spectrogram.cpp
curl -L https://raw.githubusercontent.com/psmdv/log_mel_spectrogram/main/src/wavreader.cpp -o speech_to_image/app/src/main/cpp/speech_to_image/src/LogMel/src/wavreader.cpp
curl -L https://raw.githubusercontent.com/psmdv/log_mel_spectrogram/main/inc/log_mel_spectrogram.hpp -o speech_to_image/app/src/main/cpp/speech_to_image/src/LogMel/include/log_mel_spectrogram.hpp
curl -L https://raw.githubusercontent.com/psmdv/log_mel_spectrogram/main/inc/wavreader.h -o speech_to_image/app/src/main/cpp/speech_to_image/src/LogMel/include/wavreader.h

# Image writing utility
curl -L https://raw.githubusercontent.com/nothings/stb/master/stb_image_write.h -o speech_to_image/app/src/main/cpp/speech_to_image/src/Utils/stb_image_write.h

6. Download AI model assets

Step 6: Download AI model assets Create the assets directory and download model files:

mkdir -p speech_to_image/app/src/main/assets

Tokenizers

# CLIP tokenizer
curl -L https://huggingface.co/openai/clip-vit-base-patch32/resolve/main/tokenizer.json -o speech_to_image/app/src/main/assets/tokenizer.json

# Whisper tokenizer (rename to detokenizer.json)
curl -L https://huggingface.co/openai/whisper-small/resolve/main/tokenizer.json -o speech_to_image/app/src/main/assets/detokenizer.json

Stable Diffusion models

Download from Qualcomm Package Manager (requires account):

text_encoder.bin
unet.bin
vae_decoder.bin

Place them in speech_to_image/app/src/main/assets/

Whisper models

# Tiny Whisper models for Snapdragon 8 Elite
curl -L https://huggingface.co/qualcomm/Whisper-Tiny/resolve/main/precompiled/qualcomm-snapdragon-8-elite-for-galaxy/whisper_encoder_tiny.bin -o speech_to_image/app/src/main/assets/whisper_encoder_tiny.bin
curl -L https://huggingface.co/qualcomm/Whisper-Tiny/resolve/main/precompiled/qualcomm-snapdragon-8-elite-for-galaxy/whisper_decoder_tiny.bin -o speech_to_image/app/src/main/assets/whisper_decoder_tiny.bin

Audio processing assets

# Mel spectrogram filterbank
curl -L https://raw.githubusercontent.com/psmdv/log_mel_spectrogram/main/assets/mel80.bin -o speech_to_image/app/src/main/assets/mel80.bin

Generate embedding tokens

Run the Python script to generate required token files:

python python_scripts/generate_t_emb_and_empty_tokens.py

This creates t_emb_0.raw through t_emb_19.raw and tokens.raw in the assets directory.

7. Configure device for QNN

Step 7: Configure device for QNN For proper QNN functionality, push the HTP skeleton library to device:

adb root
adb shell setenforce 0
adb disable-verity
adb shell mount -o rw,remount /vendor
adb push $QNN_SDK_ROOT/lib/hexagon-vXX/unsigned/libQnnHtpVXXSkel.so /vendor/lib/rfsa/adsp/

Device modifications require root access. This step is necessary for HTP acceleration but may void warranties.

8. Build and run the application

Step 8: Configure microphone permissions Add the following to your AndroidManifest.xml if not already present:

<uses-permission android:name="android.permission.RECORD_AUDIO" />
<uses-feature android:name="android.hardware.microphone" />

Step 9: Build and run the application Open the project in Android Studio:

# Open Android Studio and select "Open" > navigate to speech_to_image directory

Ensure the following in Android Studio: Build and run on your Snapdragon device. The app will:

Request microphone permissions
Display a chat interface
Record speech when you tap the microphone button
Process speech to text using Whisper
Generate images using Stable Diffusion
Display results in the chat

Common workflow

All Android tutorials in this series follow this pattern:

Export model from Studio → Deployment → Android (C++ library)
Download TFLite libraries (if using TFLite):

cd app/src/main/cpp/tflite
sh download_tflite_libs.sh  # or .bat for Windows

Copy model files to app/src/main/cpp/ (skip CMakeLists.txt)
Update test features or assets as needed
Build and run in Android Studio

Troubleshooting

Build issues

Missing libraries: Verify all QNN and Rust libraries are copied to jniLibs/arm64-v8a/
Header errors: Ensure resolveDependencies.sh ran successfully
CMake errors: Check NDK and CMake versions match requirements

Runtime issues

QNN initialization fails: Confirm device has Hexagon NPU and libraries are pushed correctly
Model loading fails: Verify all asset files are present and not corrupted
Poor performance: Ensure device is not overheating; QNN acceleration may throttle

Audio processing

No speech detected: Check microphone permissions and device audio settings
Poor transcription: Whisper tiny model is lightweight; consider larger models for better accuracy

Performance expectations

On Snapdragon 8 Elite:

Speech-to-text: ~100-200ms for 5-10 second clips
Image generation: ~2-5 seconds for 512×512 images
Total latency: ~3-7 seconds end-to-end

Performance varies by device, model size, and prompt complexity. The app demonstrates on-device AI capabilities without cloud dependency.

Additional resources

Summary

You’ve successfully utilized the Qualcomm QNN hardware acceleration for GenAI models on Android. Key takeaways:

Integrated Whisper and Stable Diffusion models for speech-to-image generation
Leveraged QNN SDK for efficient on-device inference
Built a user-friendly Android application with Kotlin and C++

With QNN acceleration, you can deploy production-grade ML applications on edge devices with real-time performance and low power consumption. Questions? Join the discussion on the Edge Impulse Forum! This tutorial demonstrates the power of running sophisticated AI models entirely on mobile devices using Qualcomm’s hardware acceleration capabilities.

OVERVIEW

END-TO-END

TOPICS

TOOLS

INTEGRATIONS

Building a Speech-to-Image Android App with QNN Acceleration

What you’ll build

Architecture overview

Prerequisites

Supported devices

1. Clone the repository

2. Set up Qualcomm AI Engine Direct SDK

3. Install Rust and Android targets

4. Compile Rust tokenization libraries

5. Set up Android project dependencies

Copy QNN libraries

Copy QNN headers and utilities

Copy additional dependencies

6. Download AI model assets

Tokenizers

Stable Diffusion models

Whisper models

Audio processing assets

Generate embedding tokens

7. Configure device for QNN

8. Build and run the application

Common workflow

Troubleshooting

Build issues

Runtime issues

Audio processing

Performance expectations

Additional resources

Summary

OVERVIEW

END-TO-END

TOPICS

TOOLS

INTEGRATIONS

​What you’ll build

​Architecture overview

​Prerequisites

​Supported devices

​1. Clone the repository

​2. Set up Qualcomm AI Engine Direct SDK

​3. Install Rust and Android targets

​4. Compile Rust tokenization libraries

​5. Set up Android project dependencies

​Copy QNN libraries

​Copy QNN headers and utilities

​Copy additional dependencies

​6. Download AI model assets

​Tokenizers

​Stable Diffusion models

​Whisper models

​Audio processing assets

​Generate embedding tokens

​7. Configure device for QNN

​8. Build and run the application

​Common workflow

​Troubleshooting

​Build issues

​Runtime issues

​Audio processing

​Performance expectations

​Additional resources

​Summary

What you’ll build

Architecture overview

Prerequisites

Supported devices

1. Clone the repository

2. Set up Qualcomm AI Engine Direct SDK

3. Install Rust and Android targets

4. Compile Rust tokenization libraries

5. Set up Android project dependencies

Copy QNN libraries

Copy QNN headers and utilities

Copy additional dependencies

6. Download AI model assets

Tokenizers

Stable Diffusion models

Whisper models

Audio processing assets

Generate embedding tokens

7. Configure device for QNN

8. Build and run the application

Common workflow

Troubleshooting

Build issues

Runtime issues

Audio processing

Performance expectations

Additional resources

Summary