Skip to main content
Build an advanced Android application that converts speech audio into detailed images using Qualcomm’s AI Engine Direct SDK. This tutorial demonstrates real-time speech-to-text processing with Whisper and image generation with Stable Diffusion, all accelerated on Snapdragon hardware.
This tutorial showcases cutting-edge AI capabilities on mobile devices. You’ll build a chat-like interface where users can speak prompts to generate images using hardware-accelerated AI models. Reference code: https://github.com/edgeimpulse/example-android-inferencing/tree/main/qnn-genai-speech_to_image

What you’ll build

An Android application that:
  • Records speech input using Android’s audio APIs
  • Processes speech to text using Whisper model accelerated by QNN
  • Generates detailed images from text prompts using Stable Diffusion
  • Displays results in a chat-like interface
  • Runs entirely on-device with hardware acceleration
Time: 2-3 hours
Difficulty: Advanced

Architecture overview

The app uses a hybrid architecture:
  • Kotlin UI: Handles user interaction, audio recording, and display
  • C++ Backend: QNN APIs for model inference on Snapdragon HTP
  • JNI Bridge: Seamless communication between Kotlin and C++
  • Rust Libraries: Tokenization/detokenization for model inputs/outputs

Prerequisites

  • Edge Impulse account: Sign up
  • Trained model: Complete a tutorial first (for custom workflows)
  • Android Studio: 2024.3.1 or later
  • Android NDK: Version 27.2.12479018
  • Rust toolchain: Latest stable with Android targets
  • Qualcomm AI Engine Direct SDK: v2.40.0 (Download)
  • Python 3.10: For asset generation scripts
  • Snapdragon device with Hexagon NPU (see supported devices below)
This app requires significant setup and model assets. Ensure you have a Snapdragon device with adequate storage for the AI models (~500MB+).

Supported devices

Tested on Snapdragon 8 Elite. Compatible with devices featuring Hexagon NPU Gen 2 or later: Mobile:
  • Snapdragon 8 Gen 3/2/1
  • Snapdragon 7+ Gen 2/3
  • Snapdragon 6 Gen 1
Embedded:
  • QRB6490 (Rubik Pi 3)
  • QRB5165 (RB5)
  • QCS2210 (Arduino UNO Q)

1. Clone the repository

Step 1: Clone the repository
git clone https://github.com/edgeimpulse/example-android-inferencing.git
cd example-android-inferencing/qnn-genai-speech_to_image

2. Set up Qualcomm AI Engine Direct SDK

Step 2: Set up Qualcomm AI Engine Direct SDK Download and extract the QNN SDK, then set the environment variable:
export QNN_SDK_ROOT=/path/to/your/qnn/sdk
The SDK provides the necessary libraries and tools for QNN acceleration. Version 2.40.0 is recommended.

3. Install Rust and Android targets

Step 3: Install Rust and Android targets
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

# Add Android target
rustup target add aarch64-linux-android

# Install cargo-ndk
cargo install cargo-ndk

4. Compile Rust tokenization libraries

Step 4: Compile Rust tokenization libraries The app uses Rust libraries for text tokenization. Build them for Android:
# Build tokenizer
cd tokenizer
cargo ndk -t arm64-v8a build --release

# Build detokenizer
cd ../detokenizer
cargo ndk -t arm64-v8a build --release

5. Set up Android project dependencies

Step 5: Set up Android project dependencies

Copy QNN libraries

Create the JNI libs directory and copy required libraries:
mkdir -p speech_to_image/app/src/main/jniLibs/arm64-v8a
Copy these libraries (adjust VXX for your Hexagon version, e.g., V79 for Snapdragon 8 Elite):
# From Android NDK
cp $ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android/libc++_shared.so speech_to_image/app/src/main/jniLibs/arm64-v8a/

# From Rust builds
cp tokenizer/target/aarch64-linux-android/release/libtokenizer.so speech_to_image/app/src/main/jniLibs/arm64-v8a/
cp detokenizer/target/aarch64-linux-android/release/libdetokenizer.so speech_to_image/app/src/main/jniLibs/arm64-v8a/

# From QNN SDK (adjust VXX)
cp $QNN_SDK_ROOT/lib/hexagon-vXX/unsigned/libQnnHtpVXX.so speech_to_image/app/src/main/jniLibs/arm64-v8a/
cp $QNN_SDK_ROOT/lib/hexagon-vXX/unsigned/libQnnHtpVXXSkel.so speech_to_image/app/src/main/jniLibs/arm64-v8a/
cp $QNN_SDK_ROOT/lib/aarch64-android/libQnnHtp.so speech_to_image/app/src/main/jniLibs/arm64-v8a/
cp $QNN_SDK_ROOT/lib/aarch64-android/libQnnHtpPrepare.so speech_to_image/app/src/main/jniLibs/arm64-v8a/
cp $QNN_SDK_ROOT/lib/aarch64-android/libQnnHtpVXXStub.so speech_to_image/app/src/main/jniLibs/arm64-v8a/
cp $QNN_SDK_ROOT/lib/aarch64-android/libQnnSystem.so speech_to_image/app/src/main/jniLibs/arm64-v8a/

Copy QNN headers and utilities

Run the dependency resolution script:
source scripts/resolveDependencies.sh
This copies QNN headers and additional C++ utilities to the project.

Copy additional dependencies

Download and copy required source files:
# Log mel spectrogram utilities
curl -L https://raw.githubusercontent.com/psmdv/log_mel_spectrogram/main/src/log_mel_spectrogram.cpp -o speech_to_image/app/src/main/cpp/speech_to_image/src/LogMel/src/log_mel_spectrogram.cpp
curl -L https://raw.githubusercontent.com/psmdv/log_mel_spectrogram/main/src/wavreader.cpp -o speech_to_image/app/src/main/cpp/speech_to_image/src/LogMel/src/wavreader.cpp
curl -L https://raw.githubusercontent.com/psmdv/log_mel_spectrogram/main/inc/log_mel_spectrogram.hpp -o speech_to_image/app/src/main/cpp/speech_to_image/src/LogMel/include/log_mel_spectrogram.hpp
curl -L https://raw.githubusercontent.com/psmdv/log_mel_spectrogram/main/inc/wavreader.h -o speech_to_image/app/src/main/cpp/speech_to_image/src/LogMel/include/wavreader.h

# Image writing utility
curl -L https://raw.githubusercontent.com/nothings/stb/master/stb_image_write.h -o speech_to_image/app/src/main/cpp/speech_to_image/src/Utils/stb_image_write.h

6. Download AI model assets

Step 6: Download AI model assets Create the assets directory and download model files:
mkdir -p speech_to_image/app/src/main/assets

Tokenizers

# CLIP tokenizer
curl -L https://huggingface.co/openai/clip-vit-base-patch32/resolve/main/tokenizer.json -o speech_to_image/app/src/main/assets/tokenizer.json

# Whisper tokenizer (rename to detokenizer.json)
curl -L https://huggingface.co/openai/whisper-small/resolve/main/tokenizer.json -o speech_to_image/app/src/main/assets/detokenizer.json

Stable Diffusion models

Download from Qualcomm Package Manager (requires account):
  • text_encoder.bin
  • unet.bin
  • vae_decoder.bin
Place them in speech_to_image/app/src/main/assets/

Whisper models

# Tiny Whisper models for Snapdragon 8 Elite
curl -L https://huggingface.co/qualcomm/Whisper-Tiny/resolve/main/precompiled/qualcomm-snapdragon-8-elite-for-galaxy/whisper_encoder_tiny.bin -o speech_to_image/app/src/main/assets/whisper_encoder_tiny.bin
curl -L https://huggingface.co/qualcomm/Whisper-Tiny/resolve/main/precompiled/qualcomm-snapdragon-8-elite-for-galaxy/whisper_decoder_tiny.bin -o speech_to_image/app/src/main/assets/whisper_decoder_tiny.bin

Audio processing assets

# Mel spectrogram filterbank
curl -L https://raw.githubusercontent.com/psmdv/log_mel_spectrogram/main/assets/mel80.bin -o speech_to_image/app/src/main/assets/mel80.bin

Generate embedding tokens

Run the Python script to generate required token files:
python python_scripts/generate_t_emb_and_empty_tokens.py
This creates t_emb_0.raw through t_emb_19.raw and tokens.raw in the assets directory.

7. Configure device for QNN

Step 7: Configure device for QNN For proper QNN functionality, push the HTP skeleton library to device:
adb root
adb shell setenforce 0
adb disable-verity
adb shell mount -o rw,remount /vendor
adb push $QNN_SDK_ROOT/lib/hexagon-vXX/unsigned/libQnnHtpVXXSkel.so /vendor/lib/rfsa/adsp/
Device modifications require root access. This step is necessary for HTP acceleration but may void warranties.

8. Build and run the application

Step 8: Configure microphone permissions Add the following to your AndroidManifest.xml if not already present:
<uses-permission android:name="android.permission.RECORD_AUDIO" />
<uses-feature android:name="android.hardware.microphone" />
Step 9: Build and run the application Open the project in Android Studio:
# Open Android Studio and select "Open" > navigate to speech_to_image directory
Ensure the following in Android Studio: Build and run on your Snapdragon device. The app will:
  1. Request microphone permissions
  2. Display a chat interface
  3. Record speech when you tap the microphone button
  4. Process speech to text using Whisper
  5. Generate images using Stable Diffusion
  6. Display results in the chat

Common workflow

All Android tutorials in this series follow this pattern:
  1. Export model from Studio → Deployment → Android (C++ library)
  2. Download TFLite libraries (if using TFLite):
cd app/src/main/cpp/tflite
sh download_tflite_libs.sh  # or .bat for Windows
  1. Copy model files to app/src/main/cpp/ (skip CMakeLists.txt)
  2. Update test features or assets as needed
  3. Build and run in Android Studio

Troubleshooting

Build issues

  • Missing libraries: Verify all QNN and Rust libraries are copied to jniLibs/arm64-v8a/
  • Header errors: Ensure resolveDependencies.sh ran successfully
  • CMake errors: Check NDK and CMake versions match requirements

Runtime issues

  • QNN initialization fails: Confirm device has Hexagon NPU and libraries are pushed correctly
  • Model loading fails: Verify all asset files are present and not corrupted
  • Poor performance: Ensure device is not overheating; QNN acceleration may throttle

Audio processing

  • No speech detected: Check microphone permissions and device audio settings
  • Poor transcription: Whisper tiny model is lightweight; consider larger models for better accuracy

Performance expectations

On Snapdragon 8 Elite:
  • Speech-to-text: ~100-200ms for 5-10 second clips
  • Image generation: ~2-5 seconds for 512×512 images
  • Total latency: ~3-7 seconds end-to-end
Performance varies by device, model size, and prompt complexity. The app demonstrates on-device AI capabilities without cloud dependency.

Additional resources

Summary

You’ve successfully utilizated the Qualcomm QNN hardware acceleration for GenAI models on Android. Key takeaways:
  • Integrated Whisper and Stable Diffusion models for speech-to-image generation
  • Leveraged QNN SDK for efficient on-device inference
  • Built a user-friendly Android application with Kotlin and C++
With QNN acceleration, you can deploy production-grade ML applications on edge devices with real-time performance and low power consumption. Questions? Join the discussion on the Edge Impulse Forum! This tutorial demonstrates the power of running sophisticated AI models entirely on mobile devices using Qualcomm’s hardware acceleration capabilities.