Since the creation of Edge Impulse, we have been helping customers to deal with complex data pipelines, complex data transformation methods and complex clinical validation studies.
In most cases, before even thinking about machine learning algorithms, researchers need to build quality datasets from real-world data. These data come from various devices (prototype devices being developed vs clinical/industrial-grade reference devices), have different formats (excel sheets, images, csv, json, etc...), and be stored in various places (researchers' computers, Dropbox folders, Google Drive, S3 buckets, etc...).
Dealing with such complex data infrastructure is time-consuming and expensive to develop and maintain. With the organizational data, we want to give you tools to centralize, validate and transform datasets so they can be easily imported into your projects to train your machine learning models.
Only available with Edge Impulse Enterprise Plan
Try our FREE Enterprise Trial today.
Health reference design
We have built a health reference design that describes an end-to-end ML workflow for building a wearable health product using Edge Impulse.
In this reference resign, we want to help you understand how to create a full clinical data pipeline by using a public dataset from the PPG-DaLiA repository. This tutorial will guide you through the following steps:
Before we get started, you must link your organization with one or more storage buckets. Further details about how to integrate with cloud storage providers can be found in the Cloud data storage document.
Two types of dataset structures can be used - Generic datasets (default) and Clinical datasets.
There is no required format for data files. You can upload data in any format, whether it's CSV, Parquet, or a proprietary data format.
However, to import data items to an Edge Impulse project, you will need to use the right format as our studio ingestion API only supports these formats:
JPG, PNG images
MP4, AVI video files
WAV audio files
CBOR/JSON files in the Edge Impulse data acquisition format
CSV files
Tip: You can use transformation blocks to convert your data
The default dataset structure is a file-based one, no matter the directory structure:
For example:
or:
Note that you will be able to associate the labels of your data items from the file name or the directory name when importing your data in a project.
The clinical datasets structure in Edge Impulse has three layers:
The dataset, a larger set of data items, grouped together.
Data item, an item with metadata and files attached.
Data file, the actual files.
See the health reference design tutorial for a deeper explanation.
Once you successfully linked your storage bucket to your organization, head to the Datasets tab and click on + Add new dataset:
Fill out the following form:
Click on Create dataset
With your datasets imported, you can now navigate into your dataset, create folders, query your dataset, add data items and import your data to an Edge Impulse project.
Default view
The default view lets you navigate in your bucket following the directory structure. You can easily add data using the "+ New folder" button. To add new data, use the right panel - drag and drop your files and folders and it will automatically upload them to your bucket.
Clinical view
The clinical view is slightly different, see synchronizing clinical data with a bucket for more information. This view lets you easily query your clinical dataset but to import data, you will need to set up an upload portal or upload them directly to your bucket.
Tip: You can add two distinct datasets in Edge Impulse that point to the same bucket path, one generic and one clinical. This way you can leverage both the easy upload and the ability to query your datasets.
Go to the Actions...->Import data into a project, select the project you wish to import to and click Next, Configure how to label this data:
This will import the data into the project and optionally create a new label for each file in the dataset. This labeling step helps you keep track of different classes or categories within your data.
After importing the data into the project, in the Next, post-sync actions step, you can configure a data pipeline to automatically retrieve and trigger actions in your project:
We also have added a data preview feature, allowing you to visualize certain types of data directly within the organization data tab.
Supported data types include tables (CSV/Parquet), images, PDFs, audio files (WAV/MP3), and text files (TXT/JSON). This feature gives you a quick overview of your data and helps ensure its integrity and correctness.
Any questions, or interested in the enterprise version of Edge Impulse? Contact us for more information.
Edge Impulse makes it easy to access data that you have stored in the cloud by offering integrations with several storage providers and the flexibility to connect a storage solution to an organization or directly to a project.
There are two locations within Edge Impulse where you can connect to cloud storage, from within an organization or within a project. These options are described below. For details related to the specific storage provider integration options available, please see the Storage provider integrations section of this document.
Cloud storage can be connected to an organization. By connecting your data to an organization, you are offered the flexibility to pre-process your datasets through the use of transformation blocks and to feed your datasets into multiple projects.
To connect, access your organization, select Data in the left sidebar menu, select the Buckets tab at the top of the page, then click the + Add new bucket
button. Follow the instructions in the modal window that pops up.
Cloud storage can be connected directly to a project. To connect, access your project, select Data acquisition in the left sidebar menu, select the Data sources tab at the top of the page, then click the + Add new data source
. Follow the instructions in the modal window that pops up.
Note that some options in the modal will be greyed out if your project is not on the Enterprise plan.
Edge Impulse allows you to integrate with several cloud storage options. These include:
To connect to an Amazon S3 bucket, you will need to provide:
The bucket name
The bucket region
An access key
A secret key
A path prefix (optional)
If the credentials provided do not have access to the root of the bucket, the prefix is used to specify the path for which the credentials are valid.
Currently only long-term credentials from AWS IAM users are supported; temporary credentials provided to AWS SSO users are not supported.
For Amazon S3 buckets, you will also need to enable CORS headers for the bucket. You can do this in the S3 console by going to your bucket, going to the permissions tab, and then adding the policy defined below to the cross-origin resource sharing section.
CORS policy (console):
Alternatively, you can save the below CORS policy as a cors.json
file (note there are some differences in the structure compared to the JSON above) and add it to your bucket using the AWS S3 CLI.
CORS policy (CLI):
To connect to a Google Cloud Storage bucket, you will need to provide:
The bucket name
The bucket region
An access key
A secret key
A path prefix (optional)
If the credentials provided do not have access to the root of the bucket, the prefix is used to specify the path for which the credentials are valid.
For Google Cloud Storage buckets, you will also need to enable CORS headers for the bucket. You cannot manage CORS policies using the Google Cloud console; you must use the gcloud CLI instead.
CORS policy:
Save the above CORS policy as a cors.json
file and add it to your bucket with the gcloud CLI using the following command:
gsutil is not the recommended CLI tool for Cloud Storage. You may have used this tool before, however, Google now recommends using gcloud storage commands in the Google Cloud CLI instead.
To connect to a Microsoft Azure Blob Storage blob container, you will need to provide:
The blob container name
The storage account name
A secret key
A path prefix (optional)
If the credentials provided do not have access to the root of the blob container, the prefix is used to specify the path for which the credentials are valid.
A CORS policy is not required with Microsoft Azure Blob Storage.
To connect to another (S3-compatible) type of bucket, you will need to provide:
The bucket name
The bucket region
The bucket endpoint
An access key
A secret key
A path prefix (optional)
If the credentials provided do not have access to the root of the bucket, the prefix is used to specify the path for which the credentials are valid.
For other (S3-compatible) buckets, you will also need to enable CORS headers for the bucket. Please refer to your provider documentation for instructions on how to do so.
The items that you will need to set are the following:
Origin: ["https://studio.edgeimpulse.com"]
Method: ["PUT", "POST"]
Header: ["*"]
For cloud storage integration to work as expected, Edge Impulse needs to be provided with credentials that allow read, write, and delete operations. Please refer to your storage provider documentation for specifics.
In order to verify the connection to the cloud storage provider, Edge Impulse will write an .ei-portal-check
file that will be subsequently deleted. Once a bucket is successfully connected to your organization, a green dot will appear in the connected column on the buckets overview page.
If you need to get data into your organization, you can now do this in a few simple steps. To go further and use advanced features, query your datasets or transform your dataset, please have a look at the health reference design tutorial
No common issues have been identified thus far. If you encounter an issue, please reach out on the forum or, if you are on the Enterprise plan, through your support channels.
Only available on the Enterprise plan
This feature is only available on the Enterprise plan. Review our plans and pricing or sign up for our free Enterprise trial today.