Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The "data campaigns" feature allows you to quickly track your experiments and your models' development progresses. It is an overview of your pipelines where you can easily extract useful information from your datasets and correlate those metrics with your model performances.
It has been primarily designed to follow clinical research data processes. In August 2023, we released this feature for every enterprise user as we see value in being able to track metrics between your datasets and your projects.
Only available with Edge Impulse Enterprise Plan
Try our FREE Enterprise Trial today.
To get started, navigate to the Data campaigns tab in your organization:
Click on + Create new dashboard.
Give your dashboard a name, and select one or more collaborators to receive the daily updates by email. If you don't want to be spammed, you can select when you want to receive these updates, either Always, On new data, changes or on error, or Never. Finally, set the last number of days shown in the graphs:
You can create as many dashboards as needed, simply click on + Create a new dashboard from the dropdown available under your current dashboard:
If you want to delete a dashboard, Click on Actions... -> Delete dashboard
Once your dashboard is created, you can add your custom campaigns. It's where you will specify which metrics are important to you and your use case. Click on Actions... -> Add campaign
Fill the form to create your campaign:
Name: Name of your data campaign.
Description: Description of your data campaign.
Campaign coordinators: Add the collaborators that are engaged with this campaign
Datasets: Select the datasets you want to visualize in your campaign. You can add several datasets.
Projects: Select the projects you want to visualize in your campaign. You can add several projects.
Pipelines: Select the pipeline that is associated with your campaign. Note that this is for reference only, it is currently not displayed in your campaign
Links: Select between the link type you need. Options are Github, Spreadsheet, Text Document, Code repository, List or Folder. Add a name and the link. This place is useful for other collaborators to have all the needed information about your project, gathered in one place under your campaign.
Addition queries to track: These queries are data filters that need to be written in the SQL WHERE format. See Querying data for more information. For example metadata->
age >= 18` will return the data samples from adult patients.
You can then save your data campaign and it will be added to your dashboard:
This dashboard shows the metrics' progress from the Health reference design data
If you want to edit or delete your campaign, click on the "⋮" button on the right side of your campaign:
Within an you can work on one or more projects with multiple people. These can be colleagues, outside researchers, or even members of the community. They will only get access to the specific data in the project, and not to any of the raw data in your organizational datasets.
Only available with Edge Impulse Enterprise Plan
Try our FREE today.
To invite a user in an organization, click on the "Add user button, enter the email address and select the role:
It is important to note that there are two types of users in Edge Impulse: Project Users and Organization Users.
Organization Users, typically holding roles like Admin, are responsible for the overarching management and customization of organizational elements, including datasets, storage buckets, and white label attributes. These users also encompass the capabilities of Project Users.
Conversely, Project Users, often in roles such as Member or Guest, are limited to specific project involvement, focusing on collaboration and contributions at the project level, without access to broader organizational management functions. They are granted access only to certain project data to maintain the security of raw data in organizational datasets.
For a more granular look at the capabilities of each role, see the table below:
Admins have full rights on the organization, overseeing organizational and white label functionalities, including dataset management and storage bucket updates. They also have all the rights of a Project Member.
Full Rights on the Organization
Project User rights
Manage organization datasets
Update and add storage buckets
Verify bucket connectivity
Customize white label (where applicable) attributes like themes and information
API access for organization and white label management
Members have full access on the datasets, custom blocks but cannot join a project without being invited.
Broad Access, with Restrictions on Project Joining
Project User rights
Full access to datasets and custom blocks
Can collaborate on projects, but only by invitation
Can access metrics via API
Guests have restricted access, limited to selected datasets within the projects they are associated with.
Limited Access to Selected Datasets
Project User rights
Access to selected datasets within the project they are invited to
Cannot access raw data in organizational datasets
Cannot access metrics via API
To give someone access to a project only, go to your , and find the "Collaborators" widget. Click the '+' icon, and type the username or e-mail address of the other user.
Since the creation of Edge Impulse, we have been helping customers to deal with complex data pipelines, complex data transformation methods and complex clinical validation studies.
In most cases, before even thinking about machine learning algorithms, researchers need to build quality datasets from real-world data. These data come from various devices (prototype devices being developed vs clinical/industrial-grade reference devices), have different formats (excel sheets, images, csv, json, etc...), and be stored in various places (researchers' computers, Dropbox folders, Google Drive, S3 buckets, etc...).
Dealing with such complex data infrastructure is time-consuming and expensive to develop and maintain. With the organizational data, we want to give you tools to centralize, validate and transform datasets so they can be easily imported into your projects to train your machine learning models.
Only available with Edge Impulse Enterprise Plan
Try our FREE Enterprise Trial today.
Health reference design
We have built a health reference design that describes an end-to-end ML workflow for building a wearable health product using Edge Impulse.
In this reference resign, we want to help you understand how to create a full clinical data pipeline by:
Before we get started, you must link your organization with one or several storage buckets. First, select where your data lives:
AWS S3 buckets
Google Cloud Storage
Any S3-compatible bucket
And fill the form with your bucket name, region, endpoint access and secret keys:
A green dot indicates that your bucket is connected:
Two types of dataset structures can be used - Generic datasets (default) and Clinical datasets.
There is no required format for data files. You can upload data in any format, whether it's CSV, Parquet, or a proprietary data format.
However, to import data items to an Edge Impulse project, you will need to use the right format as our studio ingestion API only supports these formats:
JPG, PNG images
MP4, AVI video files
WAV audio files
CBOR/JSON files in the Edge Impulse data acquisition format
CSV files
Tip: You can use transformation blocks to convert your data
The default dataset structure is a file-based one, no matter the directory structure:
For example:
or:
Note that you will be able to associate the labels of your data items from the file name or the directory name when importing your data in a project.
The clinical datasets structure in Edge Impulse has three layers:
The dataset, a larger set of data items, grouped together.
Data item, an item with metadata and files attached.
Data file, the actual files.
See the health reference design tutorial for a deeper explanation.
Once you successfully linked your storage bucket to your organization, head to the Datasets tab and click on + Add new dataset:
Fill out the following form:
Click on Create dataset
With your datasets imported, you can now navigate into your dataset, create folders, query your dataset, add data items and import your data to an Edge Impulse project.
The clinical view is slightly different, see synchronizing clinical data with a bucket for more information. This view lets you easily query your clinical dataset but to import data, you will need to set up an upload portal or upload them directly to your bucket.
Tip: You can add two distinct datasets in Edge Impulse that point to the same bucket path, one generic and one clinical. This way you can leverage both the easy upload and the ability to query your datasets.
Go to the Actions...->Import data into a project, select the project you wish to import to and click Next, Configure how to label this data:
This will import the data into the project and optionally create a new label for each file in the dataset. This labeling step helps you keep track of different classes or categories within your data.
After importing the data into the project, in the Next, post-sync actions step, you can configure a data pipeline to automatically retrieve and trigger actions in your project:
We also have added a data preview feature, allowing you to visualize certain types of data directly within the organization data tab.
Supported data types include tables (CSV/Parquet), images, PDFs, audio files (WAV/MP3), and text files (TXT/JSON). This feature gives you a quick overview of your data and helps ensure its integrity and correctness.
Any questions, or interested in the enterprise version of Edge Impulse? Contact us for more information.
If you see the following message, make sure to add the CORS header to your bucket settings:
You can also add the CORS using the AWS S3 CLI:
with this file cors.json
:
If you need to get data into your organization, you can now do this in a few simple steps. To go further and use advanced features, query your datasets or transform your dataset, please have a look at the health reference design tutorial
Transformation blocks are very flexible and can be used for most advanced use cases.
They can either take raw data from your organizational datasets and convert the data into files that can be loaded in an Edge Impulse project/another organizational dataset. But you can also use the transformation blocks as cloud jobs to perform specific actions using standalone mode.
Transformation blocks are available in your organization pipelines and in your project pipelines so you can automate your processes.
You can use transformation blocks to fetch external datasets, augment/create variants of your data samples, generate synthetic datasets, extract metadata from config files, create helper graphs, align and interpolate measurements across sensors, or remove duplicate entries. The possibilities are endless.
Transformation blocks can be written in any language, and run on Edge Impulse infrastructure.
Only available with Edge Impulse Enterprise Plan
Try our FREE Enterprise Trial today.
Transformation blocks can be complex to set up and are one of the most advanced features Edge Impulse provides. Feel free to ask your customer solution engineer for some help and some examples, we have been setting up complex pipelines for our customers and our engineers have acquired a lot of expertise with transformation blocks.
You can run your transformation blocks as transformation jobs. They can be triggered:
from your organization:
From this view, Custom blocks->Transformation
From the Data transformation
From the Data pipelines
from your projects:
From the Data sources (Standalone transformation blocks only)
From Synthetic data (Synthetic data blocks only)
By default, we provide several pre-built transformation blocks that you can use directly in your organization or your organization's projects.
We will add more over time when we see a recurring need or interest. The current ones are the following:
A transformation block consists of a Docker image that contains one or several scripts. The Docker image is encapsulated in the transformation block with additional parameters.
Here is a minimal configuration for the transformation blocks:
In this documentation page, we will explain how to setup a transformation block and will explain the different options.
You can directly create your transformation block within Edge Impulse Studio from a public Docker image or import existing transformation blocks:
Example repository
You can find several transformation block examples in this Github repository. These are a great way to get started, either by importing them directly in your organization or by using them as a getting-started template.
To run the data transformation jobs, see the Data transformation documentation page.
To setup your block, an easy method is to use the Edge Impulse CLI command, edge-impulse-blocks init
:
Tip: If you want to access your bucket, make sure to press <space>
to select the bucket attached to your organization.
The step above will create the following .parameters.json
file in your project directory:
To push your transformation block, simply run edge-impulse-blocks push
.
At Edge Impulse, we mostly use Python, Javascript/Typescript and Bash scripts, but you can write your transformation blocks in any language.
Dockerfile example to trigger a Bash script:
Dockerfile example to trigger a Python script and install the required dependencies:
The Dockerfile above describes a base image (Python 3.7.5), the Python dependencies (in requirements.txt
) and which script to run (transform.py
).
Note: Do not use a WORKDIR
under /home
! The /home
path will be mounted in by Edge Impulse, making your files inaccessible.
ENTRYPOINT vs RUN / CMD
If you create a custom Dockerfile, make sure to use ENTRYPOINT
to specify the application to execute, rather than RUN
or CMD
.
If you want to host your docker image on an external registry, you can use Docker Hub and use the username/image:tag
in the Docker container field.
We provide three modes to access your data:
In the Standalone mode, no data is passed to the container, but you can still access data by mounting your bucket onto the container.
At the Data item level, we pass the --in-directory
and --out-directory
arguments. The transformation jobs will run on each directory present in your selected path. These jobs can run in parallel.
At the file level, we pass the --in-file
and --out-directory
arguments. The transformation jobs will run on each file present in your selected path. These jobs can run in parallel.
Note that for the two last operation modes, you can use query filters to only include certain data items and certain files.
The stand-alone method is the most flexible option (it can work on both generic and clinical datasets). You can consider this transformation block as a cloud job that you can use for anything in your machine learning pipelines.
Please note that this mode does not support running jobs in parallel, as it is unknown in advance how many files or how many directories are present in your dataset.
To access your data, you must mount your bucket/upload portal into the container, you can do this both when setting up your transformation block using Edge Impulse CLI, or directly in the studio when creating/editing a transformation block.
You can use custom blocks parameters to retrieve the bucket name and the required directory to access your files programmatically.
Examples
--in-directory
)When selecting the Data item operation mode, two parameters will be passed to the container:
--in-directory
--out-directory
The transformation jobs will run on each "Data item" (directory) present in your selected path or dataset.
--in-file
)When selecting the File operation mode, two parameters will be passed to the container:
--in-file
--out-directory
The transformation jobs will run on each file present in selected path.
When editing your block on Edge Impulse Studio, you can set the number of desired CPUs and the memory needed for your container to run properly. Likely, you can set the limits of the same parameters.
You can update the metadata of blocks directly from a transformation block by creating a ei-metadata.json
file in the output directory. The metadata is then applied to the new data item automatically when the transform job finishes. The ei-metadata.json
file has the following structure:
Some notes:
If action
is set to add
the metadata keys are added to the data item. If action
is set to replace
all existing metadata keys are removed.
When using the CLI to setup your block, by default we mount your bucket with the following mounting point:
You can change this value if you want your transformation block to behave differently.
See adding parameters to custom blocks dedicated documentation page.
Transformation blocks get access to the following environmental variables, which let you authenticate with the Edge Impulse API. This way you don't have to inject these credentials into the block. The variables are:
EI_API_KEY
- an API key with 'member' privileges for the organization.
EI_ORGANIZATION_ID
- the organization ID that the block runs in.
EI_API_ENDPOINT
- the API endpoint (default: https://studio.edgeimpulse.com/v1).
Label image data using GPT-4o: Label image data using GPT-4o block
Text to speech transform block (Javascript): GitHub
Fetch a dataset hosted on Kaggle (Python): Github
Generate graph from sensor csv data (Python): Github
Hello Edge (Bash): Github
--in-file
)Mix background noise into audio files (Bash script): GitHub
Access your data - Helper transformation block (Python): Github
Resample CSV (Python): Github
--in-directory
)Access your data - Helper transformation block (Python): Github
Check file existence - Add ei_check metadata on file existence (Python): Github
Merge CSV files - Merge CSV files on a given key (Python): Github
Merge audio and CSV - Merge audio file and time-series CSV (Python): Github
Now that you have a better idea of what are transformation blocks, here is a graphical recap of how it works:
This is the specification for the parameters.json
file:
This is the specification for the parameters.json
file:
The job run indefinitely
If you notice that your jobs run indefinitely, it is probably because of an error or the script has not been properly terminated. Make sure to exit your script with code 0 (return 0
, exit(0)
or sys.exit(0)
) for success or with any other error code for failure.
Cannot access files in bucket
If you cannot access your files in your bucket, make sure that the mount point is properly configured.
When using the CLI, it is a common mistake to forget pressing <space>
key to select the bucket attached to your organization.
Job failed without logs (only Job failed)
It probably means that we had an issue when triggering the container. In many cases it is related with the issue above, the mount point not being properly configured.
I cannot access the logs
We are still investigating why all the logs are not displayed properly. If you are using Python, you can also flush stdout after you print it using something like print("hello", flush=True)
.
Can I host my Docker image on Docker Hub?
Yes, you can. You can test this Standalone transformation block if you'd like: luisomoreau/hello_edge:latest
Also, make sure to configure the additional block parameters with this config:
It will print "hello +name" on the transformation job logs.
Custom blocks are cloud jobs that can be hosted and used on Edge Impulse. They serve a dedicated task, are extremely flexible, let you customize your experience and fasten your time-to-market.
Creating a transformation block - to fetch, sort, validate, combine and transform existing data into robust datasets that can be imported into your projects.
Building and hosting custom DSP blocks - to create and host your custom signal processing techniques and use them directly in your projects.
Create a custom learning block - to use your custom models and load pre-trained weights with PyTorch, Keras or scikit-learn.
Building deployment blocks - to create custom deployment targets for your products.
In this section, you will find a health reference design that describes an end-to-end ML workflow for building a wearable health product using Edge Impulse. It covers an activity study in a clinical lab, where data is recorded from the wearable end device (PPG + accelerometer), a reference device (Polar H10 HR monitor), plus labels (e.g. sitting, running, biking). The data is collected and validated, then written to a clinical dataset in an Edge Impulse organization, and finally imported into an Edge Impulse project where we train a classifier.
It handles data coming from multiple sources, data alignment, and a multi-stage pipeline before the data is imported into an Edge Impulse project. We won't cover in detail all the code snippets, our solution engineers can help you set this end-to-end ML workflow.
With this health reference design section, we want to help you understand how to create a full clinical data pipeline by:
This is the specification for the deployment-metadata.json
file from .
Your Edge Impulse Organization enables your team to collaborate on multiple datasets, automation, and models in a shared workspace. It provides tools to automate data preparation tasks with reusable pipelines, enabling data transformation, preparation, and analysis of sensor data at scale. Allowing anyone in your team to quickly access relevant data through familiar tools, add versions and add traceability to your machine learning models, and lets you quickly create and monitor your Edge Impulse projects for optimal on-device performance.
Only available with Edge Impulse Enterprise Plan
Try our FREE today.
To get started, follow these guides:
Health reference design
Existing enterprise users or enterprise trial users can view their entitlement limits via the dashboard of their enterprise organization:
This view allows you to see your organization's current usage of total users, projects, compute time and storage limits. To increase your organization's limits, select the Request limit increase button to contact sales.
Upload portals are a secure way to let external parties upload data to your datasets. Through an upload portal they get an easy user interface to add data, but they have no access to the content of the dataset, nor can they delete any files. Data that is uploaded through the portal can be stored on-premise or in your own cloud infrastructure.
In this tutorial we'll set up an upload portal, show you how to add new data, and how to show this data in Edge Impulse for further processing.
Only available with Edge Impulse Enterprise Plan
Try our FREE today.
Data is stored in storage buckets, you can either use:
AWS S3 buckets
Google Cloud Storage
Any S3-compatible bucket
See .
With your storage bucket configured you're ready to set up your first upload portal. In your organization go to Data > Upload portals and choose Create new upload portal. Here, select a name, a description, the storage bucket, and a path in the storage bucket.
Note: You'll need to enable CORS headers on the bucket. If these are not configured you'll get prompted with instructions. Talk to your user success engineer (when your data is hosted by Edge Impulse), or your system administrator to configure this.
After your portal is created a link is shown. This link contains an authentication token, and can be shared directly with the third party.
Click the link to open the portal. If you ever forget the link: no worries. Click the ⋮
next to your portal, and choose View portal.
To upload data you can now drag & drop files or folders to the drop zone on the right, or use Create new folder to first create a folder structure. There's no limit to the amount of files you can upload here, and all files are hashed, so if you upload a file that's already present the file will be skipped.
Note: Files with the same name but with a different hash are overwritten.
Mount the portal directly into a transformation block via Custom blocks > Transformation blocks > Edit block, and select the portal under mount points.
Here's a Python script which uploads, lists and downloads data to a portal. To upload data you'll need to authenticate with a JWT token, see below this script for more info.
And here's a script to generate JWT tokens:
One of the most powerful features in Edge Impulse are the built-in deployment targets (under Deployment in the Studio), which let you create ready-to-go binaries for development boards, or custom libraries for a wide variety of targets that incorporate your trained impulse. You can also create custom deployment blocks for your organization. This lets developers quickly iterate on products without getting your embedded engineers involved, lets your customers build personalized firmware using their own data, or lets you create custom libraries.
In this tutorial you'll learn how to use custom deployment blocks to create a new deployment target, and how to make this target available in the Studio for all users in the organization.
Only available with Edge Impulse Enterprise Plan
Try our FREE today.
You'll need:
The .
If you receive any warnings that's fine. Run edge-impulse-blocks
afterwards to verify that the CLI was installed correctly.
Deployment blocks use Docker containers, a virtualization technique which lets developers package up an application with all dependencies in a single package. If you want to test your blocks locally you'll also need (this is not a requirement):
installed on your machine.
Go to and clone (or download) the repository. Then, open a command prompt or terminal window and run:
To initialize the block.
When a user deploys with a custom deployment block two things happen:
A package is created that contains information about the deployment (like the sensors used, frequency of the data, etc.), any trained neural network in .tflite and SavedModel formats, the Edge Impulse SDK, and all DSP and ML blocks as C++ code.
This package is then consumed by the custom deployment block, which can incorporate it with a base firmware, or repackage it into a new library.
To test this locally, you can download this package from the Studio. In your Edge Impulse project go to Deployment, and search for Custom block.
Once you click Build you'll receive a ZIP file containing the following items:
trained.tflite
- if you have a neural network in the project this contains neural network in .tflite format. This network is already fully quantized if you choose the int8
optimization, otherwise this is the float32
model.
trained.savedmodel.zip
- if you have a neural network in the project this contains the full TensorFlow SavedModel. Note that we might update the TensorFlow version used to train these networks at any time, so rely on the compiled model or the TFLite file where possible.
model-parameters
- impulse and block configuration in C++ format. Can be used by the SDK to quickly run your impulse.
tflite-model
- neural network as source code in a way that can be used by the SDK to quickly run your impulse.
Store all these files under example-custom-deployment-block/input
.
To test your deployment block you first build the container, then invoke it with the files from the input
directory. Open a command prompt or terminal, navigate to the example-custom-deployment-block
folder and:
Build the container:
Invoke the build script - this mounts the current directory in the container under /home
, and then passes the downloaded metadata script to the container:
Or if you run Windows or macOS, you can use Docker to run this application:
With the deployment block ready you can make it available in Edge Impulse. Open a command prompt or terminal window, navigate to the folder you created earlier, and run:
This packages up your folder, sends it to Edge Impulse where it'll be built, and finally is added to your organization. The transformation block is now available in Edge Impulse under Custom blocks > deployment blocks. You can go here to set the logo, update the description, and set extra command line parameters.
The deployment block is automatically available for all organizational projects. Go to the Deployment page on a project, and search for your block:
Just click Build and now you'll have a freshly built binary from your own deployment block!
Custom deployment blocks are a powerful tool for your organization. They let you build binaries for unreleased products, let you package up impulse as custom libraries, or can let your customers deploy to private targets (if you add an external collaborator to a project they'll have access to the blocks as well). Because the deployment blocks are integrated with your project, and hosted by Edge Impulse this lets everyone, from FAE to R&D developer, now iterate on on-device models without getting your embedded engineers involved.
This is the specification for the parameters.json
type:
- to add collaborators with different access rights.
- to track and visualize all your metrics over time.
- to connect a storage bucket, to learn how to deal with such complex data infrastructure and to import your data samples into your projects.
- to chain several transformation blocks and to import data into your projects.
- to run your transformation blocks and get an overview of the running jobs.
- to allow external parties to securely contribute data to your datasets.
- to match any specific use cases using dedicated cloud jobs.
We have built a that describes an end-to-end ML workflow for building a wearable health product using Edge Impulse. It is a good tutorial to understand how we handle complex data infrastructure and discover the organization's advanced features.
If you want to process data in a portal as part of a you can either:
Mount the bucket that the portal is in, as a transformation block. This will also give you access to all other data in the bucket, very useful if you need to sync other data (see ).
If the data in your portal is already in the right format you can also directly import the uploaded data to your project. In your project view, go to , **** select 'Upload portal' and follow the steps of the wizard:
If you need a secure way for external parties to contribute data to your datasets then upload portals are the way to go. They offer a friendly user interface, upload data directly into your storage buckets, and give you an easy way to use the data directly in Edge Impulse.
Any questions, or interested in the enterprise version of Edge Impulse? for more information.
deployment-metadata.json
- this contains all information about the deployment, like the names of all classes, the frequency of the data, full impulse configuration, and quantization parameters. A specification can be found here: .
edge-impulse-sdk
- a copy of the latest .
Voila. You now have an output
folder that contains a ZIP file. Unzip output/deploy.zip
and now you have a standalone application which runs your impulse. If you run Linux you can invoke this application directly (grab some data from 'Live classification' for the features, see ):
Deployment blocks do not have access to the internet by default. If you need this, or if you need to pull additional information from the project (e.g. access to DSP blocks) you can set the 'privileged' flag on a deployment block. This will enable outside internet access, and will pass in the project.apiKey
parameter in the (if a development API key is set) that you can use to authenticate with the .
You can also use custom deployment blocks with the other organizational features, and can use this to set up powerful pipelines automating , , training new impulses and then deploying back to your device - either through the UI, or via the API. If you're interested in deployment blocks or any of the other enterprise features,
Transformation blocks take raw data from your organizational datasets and convert the data into a different dataset or files that can be loaded in an Edge Impulse project. You can use transformation blocks to only include certain parts of individual data files, calculate long-running features like a running mean or derivatives, or efficiently generate features with different window lengths. Transformation blocks can be written in any language, and run on the Edge Impulse infrastructure.
In this tutorial we build a Python-based transformation block that loads Parquet files, calculates features from the Parquet file, and then writes a new file back to your dataset. If you haven't done so, go through synchronizing clinical data with a bucket first.
Only available with Edge Impulse Enterprise Plan
Try our FREE Enterprise Trial today.
You'll need:
The Edge Impulse CLI.
If you receive any warnings that's fine. Run edge-impulse-blocks
afterwards to verify that the CLI was installed correctly.
The gestures.parquet file which you can use to test the transformation block. This contains some data from the Continuous gestures dataset in Parquet format.
Transformation blocks use Docker containers, a virtualization technique that lets developers package up an application with all dependencies in a single package. If you want to test your blocks locally you'll also need (this is not a requirement):
Docker desktop installed on your machine.
1.1 - Parquet schema
This is the Parquet schema for the gestures.parquet
file which we'll transform:
To build a transformation block open a command prompt or terminal window, create a new folder, and run:
This will prompt you to log in, and enter the details for your block. E.g.:
Then, create the following files in this directory:
2.1 - Dockerfile
We're building a Python based transformation block. The Dockerfile describes our base image (Python 3.7.5), our dependencies (in requirements.txt
) and which script to run (transform.py
).
Note: Do not use a WORKDIR
under /home
! The /home
path will be mounted in by Edge Impulse, making your files inaccessible.
ENTRYPOINT vs RUN / CMD
If you use a different programming language, make sure to use ENTRYPOINT
to specify the application to execute, rather than RUN
or CMD
.
2.2 - requirements.txt
This file describes the dependencies for the block. We'll be using pandas
and pyarrow
to parse the Parquet file, and numpy
to do some calculations.
2.3 - transform.py
This file includes the actual application. Transformation blocks are invoked with three parameters (as command line arguments):
--in-file
or --in-directory
- A file (if the block operates on a file), or a directory (if the block operates on a data item) from the organizational dataset. In this case the gestures.parquet
file.
--out-directory
- Directory to write files to.
--hmac-key
- You can use this HMAC key to sign the output files. This is not used in this tutorial.
--metadata
- Key/value pairs containing the metadata for the data item, plus additional metadata about the data item in the dataItemInfo
key. E.g.:
{ "subject": "AAA001", "ei_check": "1", "dataItemInfo": { "id": 101, "dataset": "Human Activity 2022", "bucketName": "edge-impulse-tutorial", "bucketPath": "janjongboom/human_activity/AAA001/", "created": "2022-03-07T09:20:59.772Z", "totalFileCount": 14, "totalFileSize": 6347421 } }
Add the following content. This takes in the Parquet file, groups data by their label, and then calculates the RMS over the X, Y and Z axes of the accelerometer.
2.4 - Building and testing the container
On your local machine
To test the transformation block locally, if you have Python and all dependencies installed, just run:
Docker
You can also build the container locally via Docker, and test the block. The added benefit is that you don't need any dependencies installed on your local computer, and can thus test that you've included everything that's needed for the block. This requires Docker desktop to be installed.
To build the container and test the block, open a command prompt or terminal window and navigate to the source directory. First, build the container:
Then, run the container (make sure gestures.parquet
is in the same directory):
Seeing the output
This process has generated a new Parquet file in the out/
directory containing the RMS of the X, Y and Z axes. If you inspect the content of the file (e.g. using parquet-tools) you'll see the output:
Success!
With the block ready we can push it to your organization. Open a command prompt or terminal window, navigate to the folder you created earlier, and run:
This packages up your folder, sends it to Edge Impulse where it'll be built, and finally is added to your organization.
The transformation block is now available in Edge Impulse under Data transformation > Transformation blocks.
If you make any changes to the block, just re-run edge-impulse-blocks push
and the block will be updated.
Next, upload the gestures.parquet
file, by going to Data > Add data... > Add data item, setting name as 'Gestures', dataset to 'Transform tutorial', and selecting the Parquet file.
This makes the gestures.parquet
file available from the Data page.
With the Parquet file in Edge Impulse and the transformation block configured you can now create a new job. Go to Data, and select the Parquet file by setting the filter to dataset = 'Transform tutorial'
.
Click the checkbox next to the data item, and select Transform selected (1 file). On the 'Create transformation job' page select 'Import data into Dataset'. Under 'output dataset', select 'Same dataset as source', and under 'Transformation block' select the new transformation block.
Click Start transformation job to start the job. This pulls the data in, starts a transformation job and finally uploads the data back to your dataset. If you have multiple files selected the transformations will also run in parallel.
You can now find the transformed file back in your dataset:
Transformation blocks are a powerful feature which let you set up a data pipeline to turn raw data into actionable machine learning features. It also gives you a reproducible way of transforming many files at once, and is programmable through the Edge Impulse API so you can automatically convert new incoming data. If you're interested in transformation blocks or any of the other enterprise features, let us know!
Updating metadata from a transformation block
You can update the metadata of blocks directly from a transformation block by creating a ei-metadata.json
file in the output directory. The metadata is then applied to the new data item automatically when the transform job finishes. The ei-metadata.json
file has the following structure:
Some notes:
If action
is set to add
the metadata keys are added to the data item. If action
is set to replace
all existing metadata keys are removed.
Environmental variables
Transformation blocks get access to the following environmental variables, which let you authenticate with the Edge Impulse API. This way you don't have to inject these credentials into the block. The variables are:
EI_API_KEY
- an API key with 'member' privileges for the organization.
EI_ORGANIZATION_ID
- the organization ID that the block runs in.
EI_API_ENDPOINT
- the API endpoint (default: https://studio.edgeimpulse.com/v1).
Custom parameters
You can specify custom arguments or parameters to your block by adding a parameters.json
file in the root of your block directory. This file describes all arguments for your training pipeline, and is used to render custom UI elements for each parameter. For example, this parameters file:
Renders the following UI when you run the transformation block:
And the options are passed in as command line arguments to your block:
For more information, and all options see Adding parameters to custom blocks.
In this section, we will show how to synchronize research data with a bucket in your organizational dataset. The goal of this step is to gather data from different sources and sort them to obtain a sorted dataset (that we will then validate in the next section).
Only available with Edge Impulse Enterprise Plan
Try our FREE Enterprise Trial today.
The reference design described in the health reference design consists of 10 subjects performing 1.5 - 2 hours of activities in a research lab. Participants have a study ID (e.g. AMS_001) that is used to refer to the participant. For each participant we have 4 CSV files:
accelerometer.csv
- data from the wearable end device.
ppg.csv
- data from the wearable end device.
polar_h10.csv
- reference data from a commercial reference device (Polar H10).
labels.csv
- labels of the activity, as recorded by the research lab.
We've mimicked a proper research study, and have split the data up into two locations.
accelerometer.csv
/ ppg.csv
- live in the company data lake in S3. The data lake uses an internal structure with non-human readable IDs for each participant (e.g. 2E93ZX
for anonymized data):
polar_h10.csv
/ labels.csv
are uploaded by the research partner to an upload portal. The files are prefixed with the study ID:
To create the mapping between the study ID and the internal data lake ID we use a study master sheet. It contains information about all participants, ID mapping, and metadata. E.g.:
Notes: This master sheet was made using a Google Sheet but can be anything. All data (data lake, portal, output) are hosted in an Edge Impulse S3 bucket but can be stored anywhere (see below).
Data is stored in storage buckets, which can either be hosted by Edge Impulse, or in your own infrastructure. If you choose to host the data yourself your infrastructure should be available through the S3 API, and you are responsible for setting up proper backups. To configure a new storage bucket, head to your organization, choose Data > Buckets, click Add new bucket, and fill in your access credentials. Our solution engineers are also here to help you set up the buckets for you.
With the storage bucket in place you can create your first dataset. Datasets in Edge Impulse have three layers:
The dataset, a larger set of data items, grouped together.
Data item, an item with metadata and files attached.
Data file, the actual files.
No required format for data files
There is no required format for data files. You can upload data in any format, whether it's CSV, Parquet, or a proprietary data format.
There are three ways of uploading data into your organization. You can either:
Upload data directly to the storage bucket (recommended method). In this case use Add data... > Add dataset from bucket and the data will be discovered automatically.
Upload data through the Edge Impulse API.
Upload the files through the Upload Portals.
The sorter is the first step of the research pipeline. It's job is to fetch the data from all locations (here: internal data lake, portal, metadata from study master sheet) and create a research dataset in Edge Impulse. It does this by:
Creating a new structure in S3 like this:
Syncing the S3 folder with a research dataset in your Edge Impulse organization (like AMS Activity Study 2022
).
Updating the metadata with the metadata from the master sheet (Age
, BMI
, etc...).
With the data sorted we then:
Need to verify that the data is correct (see validate your research data)
Combine the data into a single Parquet file. This is essentially the contract we have for our dataset. By settling on a standard format (strong typed, same column names everywhere) this data is now ready to be used for ML, new algorithm development, etc. Because we also add metadata for each file here we're very quickly building up a valuable R&D datastore.
All these steps can be run through different transformation blocks and executed one after the other using data pipelines.
Building data pipelines is a very useful feature where you can stack several transformation blocks similar to the Data sources pipelines. They can be used in a standalone mode (just execute several transformation jobs in a pipeline), to feed a dataset or to feed a project.
Only available with Edge Impulse Professional and Enterprise Plans
Try our Professional Plan or FREE Enterprise Trial today.
The examples in the screenshots below shows how to create and use a pipeline to create the 'AMS Activity 2022' dataset.
To create a new pipeline, click on '+Add a new pipeline:
In your organization workspace, go to Custom blocks -> Transformation and select Run job on the job you want to add.
Select Copy as pipeline step and paste it to the configuration json file.
You can then paste the copied step directly to the respected field.
Below, you have an option to feed the data to either a organisation dataset or an Edge Impulse project
By default, your pipeline will run every day. To schedule your pipeline jobs, click on the ⋮
button and select Edit pipeline.
Once the pipeline has successfully finished, it can send an email to the Users to notify.
Once your pipeline is set, you can run it directly from the UI, from external sources or by scheduling the task.
To run your pipeline from Edge Impulse studio, click on the ⋮
button and select Run pipeline now.
To run your pipeline from Edge Impulse studio, click on the ⋮
button and select Run pipeline from code. This will display an overlay with curl
, Node.js
and Python
code samples.
You will need to create an API key to run the pipeline from code.
Another useful feature is to create a webhook to call a URL when the pipeline has ran. It will run a POST request containing the following information:
Only available with Edge Impulse Enterprise Plan
Try our FREE Enterprise Trial today.
You can optionally show a check mark in the list of data items, and show a check list for data items. This can be used to quickly view which data items are complete (if you need to capture data from multiple sources) or whether items are in the right format.
Checklists look trivial, but are actually very powerful as they give quick insights in dataset issues. Missing these issues until after the study is done can be super expensive.
Checklists are written to ei-metadata.json
and are automatically being picked up by the UI.
Checklists are driven by the metadata for a data item. Set the ei_check
metadata item to either 0
or 1
to show a check mark in the list. Set an ei_check_KEYNAME
metadata item to 0
or 1
to show the item in the check list.
To query for items with or without a check mark, use a filter in the form of:
To make it easy to create these lists on the fly you can set these metadata items directly from a transformation block
For the reference design described and used in the previous pages, the combiner takes in a data item, and writes out:
A checklist, e.g.:
✔ - PPG file present
✔ - Accelerometer file present
✘ - Correlation between Polar/PPG HR is at least 0.5
If the checklist is OK, a combined.parquet
file.
A hr.png
file with the correlation between HR found from PPG, and HR from the reference device. This is useful for two reasons:
If the correlation is too low we're looking at the wrong file, or data is missing.
Verify if the PPG => HR algorithm actually works.
Organizational datasets contain a powerful query system which lets you explore and slice data. You control the query system through the 'Filter' text box, and you use a language which is very similar to SQL (documentation).
Only available with Edge Impulse Enterprise Plan
Try our FREE Enterprise Trial today.
For example, here are some queries that you can make:
dataset like '%AMS Activity Study%'
- returns all items and files from the study.
bucket_name = 'edge-impulse-health-reference-design' AND --labels sitting,walking
- returns data whose label is 'sitting' and 'walking, and that is stored in the 'edge-impulse-health-reference-design' bucket.
metadata->ei_check = 0
- return data that have a metadata field 'ei_check' which is '0'.
created > DATE('2022-08-01')
- returns all data that was created after Aug 1, 2022.
After you've created a filter, you can select one or more data items, and select Actions...>Download selected to create a ZIP file with the data files. The file count reflects the number of files returned by the filter.
The previous queries all returned all files for a data item. But you can also query files through the same filter. In that case the data item will be returned, but only with the files selected. For example:
file_name LIKE '%.png'
- returns all files that end with .png
.
If you have an interesting query that you'd like to share with your colleagues, you can just share the URL. The query is already added to it automatically.
These are all the available fields in the query interface:
dataset
- Dataset.
bucket_id
- Bucket ID.
bucket_name
- Bucket name.
bucket_path
- Path of the data item within the bucket.
id
- Data item ID.
name
- Data item name.
total_file_count
- Number of files for the data item.
total_file_size
- Total size of all files for the data item.
created
- When the data item was created.
metadata->key
- Any item listed under 'metadata'.
file_name
- Name of a file.
file_names
- All filenames in the data item, that you can use in conjunction with CONTAINS
. E.g. find all items with file X, but not file Y: file_names CONTAINS 'x' AND not file_names CONTAINS 'y'
.
Data transformation or transformation jobs refer to processes that apply specific transformations to the data within an Edge Impulse organizational dataset. These jobs are executed using Transformation blocks, which are essentially scripts packaged in Docker containers. They perform a variety of tasks on the data, enabling more advanced and customized dataset transformation and manipulation.
The transformation jobs can be chained together in Data pipelines to automate your workflows.
Only available with Edge Impulse Enterprise Plan
Try our FREE Enterprise Trial today.
You have several options to create a transformation job:
From the Data transformation page by selecting the Create job tab.
From the Custom blocks->Transformation page by selecting the "⋮" action button and selecting Run job.
From the Data page:
Depending on whether you are on a Default dataset or a Clinical dataset, the view will vary:
Again, depending on whether you are on a Default dataset or a Clinical dataset, the view will vary. The common options are the Name of the transformation job, the Transformation block used for the job.
If your Transformation block has additional custom parameters, the input fields will be displayed below in a Parameters section. For example:
Dataset type options:
Clinical Datasets: Operate on "data items" with a strict file structure. Transformation is specified using SQL-like syntax.
Default Datasets: Resemble a typical file system with flexible structure. You can specify data for transformation using wildcards.
For more information about the two dataset types, see the dedicated Data page.
Input
After selecting your Input dataset, you can filter which files or directory you want to transform.
In default dataset formats, we use wildcard filters (in a similar format to wildcards in git). This enable you to specify patterns that match multiple files or directories within your dataset:
Asterisk ( * ): Represents any number of characters (including zero characters) in a filename or directory name. It is commonly used to match files of a certain type or files whose names follow a pattern.
Example: /folder/*.png
matches all PNG files in the /folder
directory.
Example: /data/*/results.csv
matches any results.csv file in a subdirectory under /data
.
Double Asterisk ( ** ): Used to match any number of directories, including nested ones. This is particularly useful when the structure of directories is complex or not uniformly organized.
Example: /data/**/experiment-*
matches all files or directories starting with experiment-
in any subdirectory under /data
.
Output
When you work with default datasets in Edge Impulse, you have the flexibility to define how the output from your transformation jobs is structured. There are three main rules to choose from:
No Subfolders: This rule places all transformed files directly into your specified output directory, without creating any subfolders. For example, if you transform .txt
files in /data
and choose /output
as your output directory, all transformed files will be saved directly in /output
.
Subfolder per Input Item: Here, a new subfolder is created in the output directory for each input file or folder. This keeps the output from each item organized and separate. For instance, if your input includes folders like /data/2020
, /data/2021
, and /data/2022
, and you apply this rule with /transformed
as your output directory, you will get subfolders like /transformed/2020
, /transformed/2021
, and /transformed/2022
, each containing the transformed data from the corresponding input year.
Use Full Path: This rule mirrors the entire input path when creating new sub-folders in the output directory. It's especially useful for maintaining a clear trace of where each piece of output data originated, which is important in complex directory structures. For example, if you're transforming files in /project/data/experiments
, and you choose /results
as your output directory, the output will follow the full input path, resulting in transformed data being stored in /results/project/data/experiments
.
Note: For the transformation blocks operating on files when selecting the Subfolder or Full Path option, we will use the file name without extension to create the base folder. e.g. /activity-detection/Accelerometer.csv
will be uploaded to /activity-detection-output/Accelerometer/
.
Input
When running transformation jobs using the Clinical dataset option, you can query your input files or folders in all your clinical datasets. We use a different filtering mechanism for the Clinical datasets.
Filters
You can use a language which is very similar to SQL (documentation). See more on how to query your data on the dedicated documentation page. For example you can use filters like the following:
dataset = 'Activity Detection (Clinical view)' AND file_name like 'Accelero%'
dataset = 'Activity Detection (Clinical view)' AND metadata->ei_check = 1
Import into project
Import into dataset
Number of parallel jobs
For transformation jobs operating on Data items (directory) or on Files, you can edit the number of parallel jobs to run simultaneously
Users to notify
Finally, you can select users you want to notify over email when this job finishes.