Transformation blocks

Transformation blocks are very flexible and can be used for most advanced use cases.

They can either take raw data from your organizational datasets and convert the data into files that can be loaded in an Edge Impulse project/another organizational dataset. But you can also use the transformation blocks as cloud jobs to perform specific actions using standalone mode.

Transformation blocks are available in your organization pipelines and in your project pipelines so you can automate your processes.

You can use transformation blocks to fetch external datasets, augment/create variants of your data samples, generate synthetic datasets, extract metadata from config files, create helper graphs, align and interpolate measurements across sensors, or remove duplicate entries. The possibilities are endless.

Transformation blocks can be written in any language, and run on Edge Impulse infrastructure.

Only available in Edge Impulse Organizations

Try it out with our enterprise free trial or view our pricing for more information.

Transformation blocks can be complex to set up and are one of the most advanced features Edge Impulse provides. Feel free to ask your customer solution engineer for some help and some examples, we have been setting up complex pipelines for our customers and our engineers have acquired a lot of expertise with transformation blocks.

Run transformation blocks

You can run your transformation blocks as transformation jobs. They can be triggered:

from your organization:

from your projects:

Public blocks

By default, we provide several pre-built transformation blocks that you can use directly in your organization or your organization's projects.

We will add more over time when we see a recurring need or interest. The current ones are the following:

Understanding the transformation blocks

A transformation block consists of a Docker image that contains one or several scripts. The Docker image is encapsulated in the transformation block with additional parameters.

Here is a minimal configuration for the transformation blocks:

In this documentation page, we will explain how to setup a transformation block and will explain the different options.

Import existing transformation blocks

You can directly create your transformation block within Edge Impulse Studio from a public Docker image or import existing transformation blocks:

Example repository

You can find several transformation block examples in this Github repository. These are a great way to get started, either by importing them directly in your organization or by using them as a getting-started template.

To run the data transformation jobs, see the Data transformation documentation page.

Setting up transformation blocks

To setup your block, an easy method is to use the Edge Impulse CLI command, edge-impulse-blocks init:

$> edge-impulse-blocks init 

Edge Impulse Blocks v1.21.1
? In which organization do you want to create this block? 
❯ Developer Relations 
  Medical Laboratories inc.
  Demo Team 
Attaching block to organization 'Developer Relations'
? Choose a type of block
❯ Transformation block 
  Deployment block 
  DSP block 
  Machine learning block
? Choose an option: 
❯ Create a new block 
  Update an existing block 
? Enter the name of your block: Generate helper graphs from sensor CSV
? Enter the description of your block: Transformation block to help you visualize what how your sensor time series data look like by creating a graph from the CSV files
? What type of data does this block operate on? 
  File (--in-file passed into the block) 
  Data item (--in-directory passed into the block) 
❯ Standalone (runs the container, but no files / data items passed in)
? Which buckets do you want to mount into this block (will be mounted under /mnt/s3fs/BUCKET_NAME, you can change these mount points in the Studio)?
(Press <space> to select, <a> to toggle all, <i> to invert selection)
❯ ◉ edge-impulse-devrel-team
  ◯ ei-datasets
❯ yes 
  no 

Tip: If you want to access your bucket, make sure to press <space> to select the bucket attached to your organization.

The step above will create the following .ei-block-config in your project directory:

{
    "version": 1,
    "config": {
        "edgeimpulse.com": {
            "name": "Generate graphs from sensor CSV - Standalone",
            "type": "transform",
            "description": "Generate graphs from sensor CSV - Standalone",
            "organizationId": XXXX,
            "operatesOn": "standalone",
            "transformMountpoints": [
                {
                    "bucketId": 3096,
                    "mountPoint": "/mnt/s3fs/edge-impulse-devrel-team"
                }
            ],
            "id": 5086
        }
    }
}

To push your transformation block, simply run edge-impulse-blocks push.

Dockerfile

At Edge Impulse, we mostly use Python, Javascript/Typescript and Bash scripts, but you can write your transformation blocks in any language.

Dockerfile example to trigger a Bash script:

FROM ubuntu:latest

WORKDIR /app

# Copy the bash script into the container
COPY hello.sh /hello.sh

# Make the bash script executable
RUN chmod +x /hello.sh

# Set the entrypoint command to run the script with the provided --name argument
ENTRYPOINT ["/hello.sh"]

Dockerfile example to trigger a Python script and install the required dependencies:

FROM python:3.7.5-stretch

WORKDIR /app

# Python dependencies
COPY requirements.txt ./
RUN pip3 --no-cache-dir install -r requirements.txt

COPY . ./

ENTRYPOINT [ "python3",  "transform.py" ]

The Dockerfile above describes a base image (Python 3.7.5), the Python dependencies (in requirements.txt) and which script to run (transform.py).

Note: Do not use a WORKDIR under /home! The /home path will be mounted in by Edge Impulse, making your files inaccessible.

ENTRYPOINT vs RUN / CMD

If you create a custom Dockerfile, make sure to use ENTRYPOINT to specify the application to execute, rather than RUN or CMD.

If you want to host your docker image on an external registry, you can use Docker Hub and use the username/image:tag in the Docker container field.

Operation modes

We provide three modes to access your data:

  • In the Standalone mode, no data is passed to the container, but you can still access data by mounting your bucket onto the container.

  • At the Data item level, we pass the --in-directory and --out-directory arguments. The transformation jobs will run on each directory present in your selected path. These jobs can run in parallel.

  • At the file level, we pass the --in-file and --out-directory arguments. The transformation jobs will run on each file present in your selected path. These jobs can run in parallel.

Note that for the two last operation modes, you can use query filters to only include certain data items and certain files.

Standalone

The stand-alone method is the most flexible option (it can work on both generic and clinical datasets). You can consider this transformation block as a cloud job that you can use for anything in your machine learning pipelines.

Please note that this mode does not support running jobs in parallel, as it is unknown in advance how many files or how many directories are present in your dataset.

To access your data, you must mount your bucket/upload portal into the container, you can do this both when setting up your transformation block using Edge Impulse CLI, or directly in the studio when creating/editing a transformation block.

You can use custom blocks parameters to retrieve the bucket name and the required directory to access your files programmatically.

Examples

Python script to create graphs from your CSV sensor Data

e.g. in Python, a script to create graphs from CSV sensor data:

import os, sys, argparse
import pandas as pd
import matplotlib.pyplot as plt

# Set the arguments
parser = argparse.ArgumentParser(description='Organization transformation block')
parser.add_argument('--sensor_name', type=str, required=True, help="Sensor data to extract to create the graph")
parser.add_argument('--bucket_name', type=str, required=False, help="Bucket where your dataset is hosted")
parser.add_argument('--bucket_directory', type=str, required=False, help="Directory in your bucket where your dataset is hosted")

args, unknown = parser.parse_known_args()

sensor_name = args.sensor_name
sensor_file = sensor_name + '.csv'

bucket_name = args.bucket_name
bucket_prefix = args.bucket_directory
mount_prefix = os.getenv('MOUNT_PREFIX', '/mnt/s3fs/')

folder = os.path.join(mount_prefix, bucket_name, bucket_prefix) if bucket_prefix else os.path.join(mount_prefix, bucket_name)

# Check if folder exists
if os.path.exists(folder):
    print('path exist', folder)
    for dirpath, dirnames, filenames in os.walk(folder):
        print("dirpath:",dirpath)
        if os.path.exists(os.path.join(dirpath, sensor_file)):
            print("File exist: ", os.path.join(dirpath, sensor_file))
            
            df = pd.read_csv(os.path.join(dirpath, sensor_file))
            df.index = pd.to_datetime(df['time'], unit='ns')

            # Get a list of all columns except 'time' and 'seconds_elapsed'
            columns_to_plot = [col for col in df.columns if col not in ['time', 'seconds_elapsed']]

            # Create subplots for each selected column
            fig, axes = plt.subplots(nrows=len(columns_to_plot), ncols=1, figsize=(12, 6 * len(columns_to_plot)))

            for i, col in enumerate(columns_to_plot):
                axes[i].plot(df.index, df[col])
                axes[i].set_title(f'{col} over time')
                axes[i].set_xlabel('time')
                axes[i].set_ylabel(col)
                axes[i].grid(True)

            # Save the figure with all subplots in the same directory
            plt.tight_layout()
            print("Graph created")
            plt.savefig(os.path.join(dirpath, sensor_name))

            # Display the plots (optional)
            # plt.show()
        
        else:
            print("file is missing in directory ", dirpath)
     
else:
    print('Path does not exist')
    sys.exit(1)


print("Finished")
exit(0)
Bash script to print "Hello +name" in the log console

e.g. a bash script to print "Hello +name", the name being passed as an argument in the transformation block using the custom block parameters:

#!/bin/bash

while [[ $# -gt 0 ]]; do
  key="$1"

  case $key in
    --name)
      NAME="$2"
      shift # past argument
      shift # past value
      ;;
    *)
      # Unknown option
      echo "Unknown option: $1"
      exit 1
      ;;
  esac
done

echo "Hello $NAME"

Data item (--in-directory)

When selecting the Data item operation mode, two parameters will be passed to the container:

  • --in-directory

  • --out-directory

The transformation jobs will run on each "Data item" (directory) present in your selected path or dataset.

Example

For example, let's consider a clinical dataset like the following, each data item has several files:

Now let's create a transformation block that simply output the arguments and copy the Accelerometer.csv file to the output dataset. This block is available in the transformation blocks Github repository

Setup the transformation job in the Studio:

You will be able to see logs for each data items looking like the following:

--in-directory:  /data/edge-impulse-devrel-team/datasets/activity-detection/Cycling-2023-09-14_06-47-00/
--out-directory:  /home/transform/26773541
--in-file:  None
in-directory has ['Accelerometer.csv', 'Accelerometer.png', 'Annotation.csv', 'Gravity.csv', 'Gyroscope.csv', 'Location.csv', 'LocationGps.csv', 'LocationNetwork.csv', 'Magnetometer.csv', 'Metadata.csv', 'Orientation.csv', 'Pedometer.csv', 'TotalAcceleration.csv']
out-directory has []
Copying file from  /data/edge-impulse-devrel-team/datasets/activity-detection/Cycling-2023-09-14_06-47-00/ to  /home/transform/26773541
out-directory has now ['Accelerometer.csv']

The copied file will be placed in a temporary directory and then be copied to the desired output dataset respecting the folder structure.

File (--in-file)

When selecting the File operation mode, two parameters will be passed to the container:

  • --in-file

  • --out-directory

The transformation jobs will run on each file present in selected path.

Example

If we use the same transformation block as above. First, make sure to set the operation mode to File by editing your transformation block:

Then set the transformation job.

We have used the following filter: dataset = 'Activity Detection (Clinical view)' and file_name like '%Gyro%'

Run the jobs and the logs for one file should look like the following:

--in-directory:  None
--out-directory:  /home/transform/26808538
--in-file:  /data/edge-impulse-devrel-team/datasets/activity-detection/Sitting-2023-09-14_09-11-15/Gyroscope.csv
out-directory has []
--in-file path /data/edge-impulse-devrel-team/datasets/activity-detection/Sitting-2023-09-14_09-11-15/Gyroscope.csv exist
coping Gyroscope.csv to /home/transform/26773541
out-directory has now ['Gyroscope.csv']

The copied file will be placed in a temporary directory and then be copied to the desired output dataset respecting the folder structure (along with the file we copied with the previous step using the Data item mode).

Compute requests & limits

When editing your block on Edge Impulse Studio, you can set the number of desired CPUs and the memory needed for your container to run properly. Likely, you can set the limits of the same parameters.

Metadata (Data item and file operation modes)

You can update the metadata of blocks directly from a transformation block by creating a ei-metadata.json file in the output directory. The metadata is then applied to the new data item automatically when the transform job finishes. The ei-metadata.json file has the following structure:

{
    "version": 1,
    "action": "add",
    "metadata": {
        "some-key": "some-value"
    }
}

Some notes:

  • If action is set to add the metadata keys are added to the data item. If action is set to replace all existing metadata keys are removed.

Mounting points

When using the CLI to setup your block, by default we mount your bucket with the following mounting point:

/mnt/s3fs/your-bucket

You can change this value if you want your transformation block to behave differently.

Custom parameters

See adding parameters to custom blocks dedicated documentation page.

Environmental variables

Transformation blocks get access to the following environmental variables, which let you authenticate with the Edge Impulse API. This way you don't have to inject these credentials into the block. The variables are:

  • EI_API_KEY - an API key with 'member' privileges for the organization.

  • EI_ORGANIZATION_ID - the organization ID that the block runs in.

  • EI_API_ENDPOINT - the API endpoint (default: https://studio.edgeimpulse.com/v1).

Examples & resources

Standalone

  • Dall-E Image Generation (Python): Tutorial / GitHub

  • Text to speech transform block (Javascript): GitHub

  • Fetch a dataset hosted on Kaggle (Python): Github

  • Generate graph from sensor csv data (Python): Github

  • Hello Edge (Bash): Github

File (--in-file)

  • Mix background noise into audio files (Bash script): GitHub

  • Access your data - Helper transformation block (Python): Github

  • Resample CSV (Python): Github

Data Item (--in-directory)

  • Access your data - Helper transformation block (Python): Github

  • Check file existence - Add ei_check metadata on file existence (Python): Github

  • Merge CSV files - Merge CSV files on a given key (Python): Github

  • Merge audio and CSV - Merge audio file and time-series CSV (Python): Github

Recap

Now that you have a better idea of what are transformation blocks, here is a graphical recap of how it works:

Troubleshooting

The job run indefinitely

If you notice that your jobs run indefinitely, it is probably because of an error or the script has not been properly terminated. Make sure to exit your script with code 0 (return 0, exit(0) or sys.exit(0)) for success or with any other error code for failure.

Cannot access files in bucket

If you cannot access your files in your bucket, make sure that the mount point is properly configured.

When using the CLI, it is a common mistake to forget pressing <space> key to select the bucket attached to your organization.

Job failed without logs (only Job failed)

It probably means that we had an issue when triggering the container. In many cases it is related with the issue above, the mount point not being properly configured.

I cannot access the logs

We are still investigating why all the logs are not displayed properly. If you are using Python, you can also flush stdout after you print it using something like print("hello", flush=True).

Can I host my Docker image on Docker Hub?

Yes, you can. You can test this Standalone transformation block if you'd like: luisomoreau/hello_edge:latest

Also, make sure to configure the additional block parameters with this config:

[
    {
        "name": "Name",
        "type": "string",
        "param": "name",
        "value": "",
        "help": "Person to greet"
    }
]

It will print "hello +name" on the transformation job logs.

Last updated