From TinyML to VLMs: Bringing Visual Intelligence to App Lab

Created By: Solomon Githu Public Project Link: Object detection with FOMO - Bottles vs. cans GitHub Repo: https://github.com/SolomonGithu/object-detection-and-vlm-description

Project description

Model cascading is an interesting technique whereby multiple models are linked sequentially to create a more powerful AI system. In this approach, each model performs a specialized task such as object detection and passes its output to another model or execution layer. For example, in a AI-powered parking lot management system, a lightweight model can detect cars and pass bounding boxes to another model that gives more data about a car such as its color, registration number, or even describe if it is parked properly. Generative AI models such as LLMs and VLMs are driving a shift in AI by advancing systems from basic classification to rich contextual descriptions. Software developments in the embedded AI field have enabled rapid AI development and deployment enabling developers to bring AI solutions to the world quickly. With platforms such as Edge Impulse, these tasks have been simplified, enabling us to rapidly create efficient AI solutions. The next challenging task for developers would be running Gen AI models on resource constrained devices. In this case, the target deployment device are not powerful CPU+GPU baked system, but rather small, low power and cost effective devices such as the Arduino UNO Q. This new board has the same form factor as the classic Arduino UNO, but it packs more performance from it’s Linux system and a fast precise STM32 microcontroller. As an enthusiast in Edge AI, I have been working on bringing seamless deployment of VLMs to the UNO Q. This has been explored before but my goal was to integrate the models with the new Arduino App Lab software. This is a unified environment for creating Arduino Apps for the new generation of dual-brain Arduino boards. The software lets developers focus on the application logic and compose your project by stacking together ready to use Arduino Bricks (modules). To demonstrate this VLM integration, the project leverages the classic bottles vs can example from Edge Impulse using the lightweight FOMO object detection model. Now suppose we want to accurately detect when any kind of beer is on the table. Instead of retraining the FOMO model with each beer image in the world, we can further analyze the detections with a VLM. In this application, when FOMO identifies a beer, the camera frame is passed to a SmolVLM-500M model running locally on the UNO Q. The model then generates contextual descriptions such as “The beverages on the table are beer and coke” or even “The beverage on the table is a can of Red Bull”. This demonstrates how VLMs can extend Computer Vision applications from simple detection to a richer visual understanding without continuously retraining models. Running this custom application is as simple as cloning the repository to your App Lab and clicking the ‘Run’ button.

Components and hardware configuration

Hardware components:

Arduino UNO Q (either the 2GB or 4GB)
USB-C Hub with power delivery
USB webcam
Cans of soda/energy drink and a bottle of beer

Software components:

Arduino App Lab
Edge Impulse Studio account

Step 1: Setup your UNO Q

Before working with the UNO Q for the first time, we need to setup the Linux system through the App Lab. Arduino have documented the necessary steps in the user manual. Once you have successfully completed this, we can setup the board for video object detection. First connect a USB-C Hub to the board. Next, connect a USB webcam to the Hub and power the system through the Power Delivery slot.

Step 2: Train a custom TinyML model with Edge Impulse

Note that as from version 0.5.0 of the App Lab, Edge Impulse model integration has been added. This impressive feature allows you to train optimized models from the Studio and deploy them to your App Lab project with a click of a button from the Deployment page. However, for today I will showcase how to configure the UNO Q to load the model from Edge Impulse. As mentioned, we will start with a pretrained model that detects beer bottles and cans. You can access the project with this URL: Object detection with FOMO - Bottles vs. cans

Sign in the Studio with your account and click the ‘Clone this project’ button that is on the top right of the page. Afterwards click ‘Clone project’ on the UI that pops up.

Once the cloning process is completed, we first need to reconfigure the project’s target hardware to the UNO Q. Click ‘Target:XXXXX’ and set the Target device to Arduino UNO Q with the dropdown list.

Next, click ‘Deployment’ and in the ‘Deployment target’ field, search for UNO Q and select the EIM binary option.

Finally at the bottom of the page, click the ‘Build’ button and the Studio will download an Edge Impulse Model, .eim binary, to your computer. EIM are native Linux and macOS binary applications that contains your full impulse created in Edge Impulse Studio. The impulse consists of the signal processing block(s) along with any learning and anomaly block(s) you added and trained. EIM files are compiled for your particular system architecture and are used to run inference natively on your system.

Once the download process is completed, rename the downloaded .eim file to model.eim.

Our first model in the cascading architecture is ready. On Edge Impulse Studio, we can see that for this model the on-device inferencing time is just 6ms on the UNO Q, with peek RAM usage of 244KB while flash usage is 77.6KB. Thanks to the impressive resources on the UNO Q and the model optimization, it is able to effortlessly run on device with plenty of room left for the VLM reasoning.

Step 3: Copy TinyML model to UNO Q

On your personal computer, use SCP, VS Code’s remote SSH extension or software such as WinSCP to copy the model.eim file to the following folder on the UNO Q:

/home/arduino/.arduino-bricks/ei-models/

On the UNO Q, create a folder in the /home/arduino/.arduino-bricks/models/custom-ei/ directory. For example, I named my folder as ‘ei-model-1000170-1’. Next, create a model.yaml file in this folder and paste the following in it:

id: "ei-model-1000170-1"
name: "Object detection with FOMO - Bottles vs. cans"
runner: "brick"
description: "manually deployed edge impulse model"
bricks:
  - id: "arduino:video_object_detection"
    model_configuration:
      EI_V_OBJ_DETECTION_MODEL: "/home/arduino/.arduino-bricks/ei-models/model.eim"
  - id: "arduino:object_detection"
    model_configuration:
      EI_V_OBJ_DETECTION_MODEL: "/home/arduino/.arduino-bricks/ei-models/model.eim"

This YAML file describes the model configuration: name, path to executable and bricks that can use it. This is important to allow the video_objectdetection brick to identify it and properly load the model in the Docker container. If these configurations are not done properly the brick will fail to load the model or it will use the default pre-trained model.

Step 4: Copy the VLM application to App Lab

On your personal computer, clone the GitHub repository:

git clone https://github.com/SolomonGithu/object-detection-and-vlm-description.git

This repo includes backend and frontend code to capture frames from a USB camera and passes them to a TinyML model. The camera feed and inference results are shown on the UI, similar to how the base project (Detect Objects on Camera) implements it. Next, use this link to download the runtime libraries and model files used by llama.cpp to run the SmolVLM-500M model locally from main.py. Once the download is completed, copy all the files in the Google Drive folder to the models folder of this repo. Afterwards, use SCP, VS Code’s remote SSH extension or software such as WinSCP to copy the repo to the /home/arduino/ArduinoApps/ folder on your UNO Q. Once this is completed, open App Lab and you should see the application listed in the ‘My Apps’ section.

In main.py, vlm_prompting_label defines the class which when detected will trigger the VLM to be loaded and prompted with a text defined by vlm_prompt. To reduce computational data, a frame is first resized before passing it to the SmolVLM-500M model. In a Vision-Language Model (VLM), a prompt consists of text inputs and visual inputs (such as images or video frames). These are then converted into tokens which are numerical chunks of data that the AI processes.

Step 5: Run the application

On App Lab, click the application and launch it with the ‘Run’ button. Starting the application for the first time will take some seconds since the system needs to pull necessary Docker images. Once this is finished the application container will be started and the app will automatically open in the web browser. You can also open the Web UI manually on the browser by setting URL to the local IP address of the UNO Q and port 7000.

In the default application, when a beer is detected the SmolVLM-500M model will be loaded and prompted. This process will utilize almost all of the remaining RAM and CPU. As the VLM processing is running, subsequent triggering will not occur until the process has finished. On the Web UI, the video feed continues showing live camera stream and bounding boxes (or centroids in the case of FOMO). Finally the response from the VLM model will be shown on the UI.

Using VLM-based actioning can also reduce false positives. For instance, sometimes the FOMO model would classify cans or background objects as beer. This errors can come from difference in lighting, background and variation in objects compared to the data used in training. However, by re-evaluating a frame with a VLM the model is able to clarify that only energy drinks are present.

SmolVLM-500M performance evaluation (2GB vs 4GB UNO Q)

The performance evaluation of SmolVLM-500M on the UNO Q 2GB and 4GB variants showed near identical inference speeds. The prompt evaluation rates was 4.08 and 3.96 tokens/second respectively and total inference times of approximately 22.3 and 23.3 seconds. Interestingly, the vision encoder dominated more than 90% of the compute time across both boards, showing that the bottleneck is the CPU during the image encoding as compared to the available memory (RAM).

When running the SmolVLM-500M model on the 2GB board, the CPU utilization was 100% while peek RAM consumption was 780MB out of the available 1.70GB.

With the same model running on the 4GB board, the CPU utilization was 99% while the peek RAM consumption was 1.21GB out of the available 3.58GB.

Note that you can run either the SmolVLM-256M or SmolVLM-500M model. However, in my experiments, the SmolVLM-256M model showed significant limitations in captioning. It occasionally produced hallucinated text, misidentified bottles for cans, and was less reliable in following instructions compared to the expectations. Looking at the model’s training details, we can see that just 18% of the training data was dedicated to image captioning tasks. This and the smaller parameter size are likely constrains of its capacity, making it suitable for relatively simple image description use cases rather than detailed visual reasoning. To load the SmolVLM-256M model, you need to first download these open-source files and put them in the models folder: mmproj-SmolVLM-256M-Instruct-Q8_0.gguf and SmolVLM-256M-Instruct-Q8_0.gguf. Next, in main.py update model_path and mmproj_path to point to the downloaded SmolVLM-256M files.

Key Takeaways

Cascading AI models significantly improves AI systems by going beyond simple detections to richer understanding. Through Edge GenAI models, we can deploy such systems locally, reducing computational cost and latency. With platforms such as Edge Impulse, this workflow has been simplified and made possible to deploy on edge devices such as the UNO Q. Beyond this, VLMs are trained on large datasets, such as The Cauldron and Docmatix. In this case, they have a stronger understanding of real-world environments. For example, the ability of the SmolVLM2 models to describe “a can of Red Bull” or “a bottle of Heineken beer” highlights potential applications such as real-time medical systems that can estimate nutritional properties such as sugar and calorie content, straight from the webcam!

OVERVIEW

EXPERT NETWORK

From TinyML to VLMs: Bringing Visual Intelligence to App Lab

Project description

Components and hardware configuration

Step 1: Setup your UNO Q

Step 2: Train a custom TinyML model with Edge Impulse

Step 3: Copy TinyML model to UNO Q

Step 4: Copy the VLM application to App Lab

Step 5: Run the application

SmolVLM-500M performance evaluation (2GB vs 4GB UNO Q)

Key Takeaways

​Project description

​Components and hardware configuration

​Step 1: Setup your UNO Q

​Step 2: Train a custom TinyML model with Edge Impulse

​Step 3: Copy TinyML model to UNO Q

​Step 4: Copy the VLM application to App Lab

​Step 5: Run the application

​SmolVLM-500M performance evaluation (2GB vs 4GB UNO Q)

​Key Takeaways

Project description

Components and hardware configuration

Step 1: Setup your UNO Q

Step 2: Train a custom TinyML model with Edge Impulse

Step 3: Copy TinyML model to UNO Q

Step 4: Copy the VLM application to App Lab

Step 5: Run the application

SmolVLM-500M performance evaluation (2GB vs 4GB UNO Q)

Key Takeaways