Model Cascading from Visual Anomaly Detection to VLM with Arduino App Lab

Public Project Link: Anomaly detection for packaging quality assurance GitHub Repo: https://github.com/SolomonGithu/anomaly-detection-and-vlm-description

Project description

Convolutional neural network (CNN) based image classification algorithms or human inspection have been the two main methods used by manufacturers for product inspections. This strategy is changing as Vision Language Models (VLMs) are offering a more advanced alternative. Additionally, companies are integrating AI agents into their procedures and combining this with VLMs improves automation. What went wrong with traditional Machine Learning models, you may ask? First, they require a lot of data for training, which is a challenging and resource-demanding task. Second, these models do not react well to changes in the inference environment such as different background, lighting or contrast. Given their training on larger and diverse datasets, large language models (LLMs) like VLMs have emerged as a solution for advanced intelligence of Computer Vision applications. They can produce reasonable outcomes through natural language and this makes them suitable for knowledge-driven tasks. Software developments in the embedded AI field have enabled rapid AI development and deployment allowing developers to bring AI solutions to the world quickly. With platforms such as Edge Impulse, these tasks have been simplified, enabling us to rapidly create efficient AI solutions. The Arduino UNO Q has the same form factor as the classic Arduino UNO, but it packs more performance from it’s Linux system and a fast precise STM32 microcontroller. In one of my previous projects, I showcased how we can run VLMs on the UNO Q. To demonstrate the use of vision language models in production inspection, the project presents a logistical use case whereby package goods are analyzed for defects such as dents, tears or liquid spills, prior to shipping. Since the VLM model is resource-demanding, we would require expensive hardware for real-time response. However, we can first quickly examine packages for visual abnormalities using a lightweight model. When this visual anomaly detection model identifies an abnormality, a SmolVLM-500M VLM is used to further analyze the image. The response from the VLM can assist human inspectors on decision-making, or it can be given to AI agents that can automatically adjust operational parameters such as robot handling forces if the packages are seen to have dents. The visual anomaly detection and VLM models all run locally on the UNO Q. The result has been packaged as an Arduino App Lab project.

Components and hardware configuration

Hardware components:

Arduino® UNO Q: either the 2GB or 4GB variant
USB-C® cable for powering the UNO Q

Software components:

Arduino App Lab
Edge Impulse Studio

Step 1: Setup your UNO Q

Before working with the UNO Q for the first time, we need to setup the Linux system through the App Lab. Arduino have documented the necessary steps in the user manual.

Step 2: Train a custom model with Edge Impulse

Note that as from version 0.5.0 of the App Lab, Edge Impulse model integration has been added to App Lab. This impressive feature allows you to train models from the Studio and deploy them to your App Lab project with a click of a button from the Deployment page. However, for today I will showcase how to configure the UNO Q to load the model from Edge Impulse. We will start with a pretrained model that detects anomalies in packaging boxes. You can access the project with this URL: Anomaly detection for packaging quality assurance.

Sign in the Studio with your account and click the ‘Clone this project’ button that is on the top right of the page. Afterwards click ‘Clone project’ on the UI that pops up. Once the cloning process finishes, click ‘Deployment’ and in the ‘Deployment target’ field, search for UNO Q and select the EIM binary option.

Finally at the bottom of the page, click the ‘Build’ button and the Studio will download an Edge Impulse Model, .eim binary, to your computer. EIM are native Linux and macOS binary applications that contains your full impulse created in Edge Impulse Studio. The impulse consists of the signal processing block(s) along with any learning and anomaly block(s) you added and trained. EIM files are compiled for your particular system architecture and are used to run inference natively on your system. Once the download process is completed, we can rename the model file to a shorter, more descriptive name such as anomaly-detection-packages-impulse1.eim. Note that Edge Impulse Studio allows you to create multiple Impulses for experimentations such as tweaking processing blocks, model architectures, etc. In this case, it is a good practice to tag the exported model files. Our first model in the cascading architecture is ready! On Edge Impulse Studio, we can see that for this model the on-device inferencing time is just 31ms on the UNO Q, with peek RAM usage of 3.3MB while flash usage is 372KB. Thanks to the impressive resources on the UNO Q and the model optimization, it is able to effortlessly run on device with plenty of room left for the SmolVLM reasoning.

Step 3: Copy anomaly detection model to UNO Q

On your personal computer, use SCP, VS Code’s remote SSH extension or software such as WinSCP to copy the anomaly-detection-packages-impulse1.eim (or as per your filename) file to the following directory on the UNO Q:

/home/arduino/.arduino-bricks/ei-models/

Using a terminal, SSH into the UNO Q and create a new directory under /home/arduino/.arduino-bricks/models/custom-ei/. For example, I named this folder as ‘ei-model-1030618-1’ (1030618 is the Studio project ID). Once created, navigate to it and create a model.yaml file (that is, /home/arduino/.arduino-bricks/models/custom-ei/your folder name/model.yaml). Paste the following content in the YAML file:

id: "ei-model-1030618-1"
name: "Anomaly detection for packaging quality assurance"
runner: "brick"
description: "manually deployed edge impulse model"

bricks:
  - id: "arduino:visual_anomaly_detection"
    model_configuration:
      EI_V_ANOMALY_DETECTION_MODEL: "/home/arduino/.arduino-bricks/ei-models/Anomaly_detection_impulse1.eim"

IMPORTANT!! Ensure the id value in model.yaml matches the created directory name (e.g id: “ei-model-1030618-1”). The YAML file describes the model configuration: name, path to .eim executable and bricks that can use it. This is important to allow the visual_anomaly_detection brick to identify it and properly load the model in the Docker container. If these configurations are not done properly the brick will fail to load the model.

Step 4: Copy the VLM application to App Lab

On your personal computer, clone the GitHub repository:

git clone https://github.com/SolomonGithu/anomaly-detection-and-vlm-description.git

This repo includes backend and frontend code to select or upload an image, process it with the two AI models and show results on the Web UI. Next, use this link to download the runtime libraries and model files used by llama.cpp to run the SmolVLM-500M model locally from main.py. Once the download is completed, copy all the files in the Google Drive folder to the models folder of this repo. In main.py, vlm_anomaly_threshold defines the least anomaly value to trigger the SmolVLM-500M model to be loaded and prompted with a text defined by vlm_prompt. In a Vision-Language Model (VLM), a prompt consists of text inputs and visual inputs (such as images or video frames). These are then converted into tokens which are numerical chunks of data that the AI processes. Open app.yaml file and replace ei-model-1030618-1 with the model id defined in the model.yaml file. Afterwards, use SCP, VS Code’s remote SSH extension or software such as WinSCP to copy the updated repo to the /home/arduino/ArduinoApps/ directory on your UNO Q. Once this is completed, open App Lab and you should see the application listed in the ‘My Apps’ section.

Step 5: Run the application

On App Lab, click the application and launch it with the ‘Run’ button. Starting the application for the first time will take some seconds since the system needs to pull necessary Docker images. Once this is finished the application container will be started and the app will automatically open in the web browser. You can also open the Web UI manually on the browser by setting URL to the local IP address of the UNO Q and port 7000. There are two ways of selecting an image for processing:

Image from sample: Select from pre-loaded test images located in the img folder.
Upload Image: Upload your own JPG or PNG image file (maximum of 500 KB) using drag-and-drop or file selection.

Click ‘Run Detection’ to analyze the selected/uploaded image. The image will first be analyzed by the visual anomaly detection model. If there is an anomaly with a value greater than vlm_anomaly_threshold, the SmolVLM-500M model is loaded and prompted to describe the image. Note that this VLM processing takes some time (around 29 seconds) and the UNO Q’s CPU utilization peeks to 99%. Once the VLM processing is completed, the Web UI will show response from the model as well as the image processed by the VLM (without red square markers). As noted earlier, Machine Learning models such as our anomaly detection model are sensitive to variations in the inference images (background, lighting, etc.). In the test image shown below, the package is similar to the ones used in training but minor changes in the scene shadows cause the model to flag parts of the background as anomalies. However, when the same image is further analyzed by the SmolVLM-500M model, it determines that the package is in good condition. This natural response can be valuable for AI workflows such as downstream AI agents that make decisions and avoid false alarms.

In another test image, the package was structurally intact but it had visible dirt on its surface. An image classification model would likely classify it as being in good condition. However, as we can see with the red markers, the anomaly detection model flagged the dirty sections and the SmolVLM-500M model described the package as being ‘dirty with some marks on it’. In an automated production environment, this natural response information could be used by an AI agent to investigate potential causes such as contamination in the assembly line, dirty robot grippers, or other issues.

Finally, when evaluating another test image of a damaged package, both the visual anomaly detection model and SmolVLM-500M models correctly identify the package condition. The VLM model however gives us more descriptions on the anomalies by informing that ‘The package is old and damaged, with holes and peeling edges’.

Note that you can run either the SmolVLM-256M or SmolVLM-500M model. However, in my experiments, the SmolVLM-256M model showed significant limitations in captioning. It occasionally produced hallucinated text, misidentified objects, and was less reliable in following instructions compared to the expectations. Looking at the model’s training details, we can see that just 18% of the training data was dedicated to image captioning tasks. This and the smaller parameter size are likely constrains of its capacity, making it suitable for relatively simple image description use cases rather than detailed visual reasoning. To load the SmolVLM-256M model, you need to first download these open-source files and put them in the models folder: mmproj-SmolVLM-256M-Instruct-Q8_0.gguf and SmolVLM-256M-Instruct-Q8_0.gguf. Next, in main.py update model_path and mmproj_path to point to the downloaded SmolVLM-256M files.

Conclusion

This project has demonstrated how cascading traditional Machine Learning with vision language models can create a more intelligent product inspection system. Lightweight models provide fast screening, while VLMs add contextual understanding. Together, this cascading architecture balances speed, accuracy and intelligence. Deploying VLMs introduces challenges due to their computational requirements. To address this, there is ongoing research in Edge GenAI, whereby models are being optimized to run on constrained hardware such as the Arduino UNO Q. In a similar way that platforms such as Edge Impulse simplify training and deploying Machine Learning models, new hardware and software advancements are extending capabilities to GenAI, with models such as SmolVLM making it feasible to run on constrained hardware.

OVERVIEW

EXPERT NETWORK

Model Cascading from Visual Anomaly Detection to VLM with Arduino App Lab

Project description

Components and hardware configuration

Step 1: Setup your UNO Q

Step 2: Train a custom model with Edge Impulse

Step 3: Copy anomaly detection model to UNO Q

Step 4: Copy the VLM application to App Lab

Step 5: Run the application

Conclusion

​Project description

​Components and hardware configuration

​Step 1: Setup your UNO Q

​Step 2: Train a custom model with Edge Impulse

​Step 3: Copy anomaly detection model to UNO Q

​Step 4: Copy the VLM application to App Lab

​Step 5: Run the application

​Conclusion

Project description

Components and hardware configuration

Step 1: Setup your UNO Q

Step 2: Train a custom model with Edge Impulse

Step 3: Copy anomaly detection model to UNO Q

Step 4: Copy the VLM application to App Lab

Step 5: Run the application

Conclusion