Building your first dataset

Organizational datasets allow you to build a large collection of organized sensor data that is internal to your organization. This data can then be used to create new Edge Impulse projects, imported in Pandas or Matlab for internal exploration by your data scientists, or be processed and shared with partners. Data files within the datasets can be stored on-premise or in your own cloud infrastructure.

In this tutorial we'll set up a first dataset, explore the powerful query tool, and show how to create new Edge Impulse projects from raw data.

Only available for enterprise customers

Organizational features are only available for enterprise customers. View our pricing for more information.

1. Configuring a storage bucket

Data is stored in storage buckets, which can either be hosted by Edge Impulse, or in your own infrastructure. If you choose to host the data yourself your infrastructure should be available through the S3 API, and you are responsible for setting up proper backups. To configure a new storage bucket, head to your organization, choose Data > Buckets, click Add new bucket, and fill in your access credentials. Make sure to name your storage bucket Internal datasets, as we'll need it to upload data later.

2. Uploading your first dataset

2.1 About datasets

With the storage bucket in place you can upload your first dataset. Datasets in Edge Impulse have three layers: 1) the dataset, a larger set of data items, grouped together. 2) data item, an item with metadata and files attached. 3) data file, the actual files. For example, if we're collecting data on physical activities from many subjects, we can have:

  • Dataset: 'Activities Field Study September 1994'.

    • Data item: 'Forrest Gump Running', with metadata fields "name=Forrest Gump" and "activity=running".

      • Data file: 'running01.parquet', with raw sensor data.

      • Data file: 'running02.parquet', with raw sensor data.

From here you can query and group the data. For example, you can retrieve all data from the 'Activities Field Study September 1994' dataset that was tagged with the 'running' activity. Or, you can select all the files that are smaller than 1MB and were generated by 'Forrest Gump' over all datasets.

2.2 Importing the continuous gestures dataset

For this tutorial we'll use a dataset containing 9 minutes of accelerometer data for a gesture recognition system. Download the dataset and unzip it in a convenient location.

No required format for data files

There is no required format for data files. You can upload data in any format, whether it's CSV, Parquet, or a proprietary data format.

There are three ways of uploading data to your dataset. You can either:

  1. Upload the files directly with the UI (we'll do this in this tutorial).

  2. Upload data through the Edge Impulse API.

  3. Or, upload data directly to the storage bucket (recommended for large datasets). In this case use Add data... > Add dataset from bucket and the data will be discovered automatically.

For this dataset we want to create four data items, one for every class ('idle', 'snake', 'updown', 'wave'). On the Data page, select Add data... > Add data item, set the name to 'Idle', the dataset to 'Gestures study', the metadata to { "gesture": "idle" }, and select all 'idle' files.

Do the same for the 'snake', 'updown' and 'wave' data, so you end up with four data items with 70 files in total.

3. Querying and downloading data

Organizational datasets contain a powerful query system which lets you explore and slice data. You control the query system through the 'Filter' text box, and you use a language which is very similar to SQL (documentation). For example, here are some queries that you can make:

  • dataset = 'Gestures study' - returns all items and files from the study.

  • bucket_name = 'Internal datasets' AND name IN ('Updown', 'Snake') - returns data whose name is either 'Updown' or 'Snake, and that is stored in the 'Internal datasets' bucket.

  • metadata->gesture = 'updown' - return data that have a metadata field 'gesture' which contains 'updown'.

  • created > DATE('2020-03-01') - returns all data that was created after March 1, 2020.

After you've created a filter, you can select one or more data items, and select Download selected to create a ZIP file with the data files. The file count reflects the number of files returned by the filter.

The previous queries all returned all files for a data item. But you can also query files through the same filter. In that case the data item will be returned, but only with the files selected. For example:

  • file_name LIKE '%.0.cbor' - returns all files that end with .0.cbor.

If you have an interesting query that you'd like to share with your colleagues, you can just share the URL. The query is already added to it automatically.

3.1 All available fields

These are all the available fields in the query interface:

  • dataset - Dataset.

  • bucket_id - Bucket ID.

  • bucket_name - Bucket name.

  • bucket_path - Path of the data item within the bucket.

  • id - Data item ID.

  • name - Data item name.

  • total_file_count - Number of files for the data item.

  • total_file_size - Total size of all files for the data item.

  • created - When the data item was created.

  • metadata->key - Any item listed under 'metadata'.

  • file_name - Name of a file.

  • file_names - All filenames in the data item, that you can use in conjunction with CONTAINS. E.g. find all items with file X, but not file Y: file_names CONTAINS 'x' AND not file_names CONTAINS 'y'.

4. Importing data in an Edge Impulse project

If you have an interesting subset of data, and want to train a machine learning on this data, you can export the data into a new Edge Impulse project. This will make a copy of the data, that you can then manipulate and explore like any other project, or share with outside researchers without any risk of leaking the rest of your dataset. Data is also stripped of any metadata, like the name of the data item, or any metadata that you attached to the files.

Edge Impulse data acquisition format

This section only applies if your data is already in either the Edge Impulse Data acquisition format (CBOR and JSON both work), or in WAV, JPG or PNG format. For other data you'll need to use a transformation block before being able to create a new project.

Let's put this in practice. You need to select some data for the new project. Go to the Data page and set the filter to:

dataset = 'Gestures study'

Then, select all items and click Transform selected (70 files)

This redirects you to the 'Transformation job' page. Under 'Import data into', select 'Project'. Under 'Project' select '+ Create new project', and enter a name. Next, select the category. This determines whether this is 'training' or 'testing' data, or that the data should be split up between these two categories. For now, select 'Split'. Then, click Create project to import the data.

This pulls down the gesture data from the bucket, and then imports it into the project. You don't need to stay on the page, the job will continue running in the background.

If you now go back to your project you have a copy of the organizational dataset to your disposal, ready to build your next machine learning model. You can also add colleagues or outside collaborators to this specific project by going to Dashboard, and selecting the "Collaborators" widget. And if you want to do another experiment with the same data, you can easily create a new project with the same flow without any fear of changing any of the source data. 🚀

Any questions, or interested in the enterprise version of Edge Impulse? Contact us for more information.

Appendix: advanced features

Checklists

You can optionally show a check mark in the list of data items, and show a check list for data items. This can be used to quickly view which data items are complete (if you need to capture data from multiple sources) or whether items are in the right format.

Checklists are driven by the metadata for a data item. Set the ei_check metadata item to either 0 or 1 to show a check mark in the list. Set an ei_check_KEYNAME metadata item to 0 or 1 to show the item in the check list.

To query for items with or without a check mark, use a filter in the form of:

metadata->ei_check = 1

To make it easy to create these lists on the fly you can set these metadata items directly from a transformation block.

Last updated

Revision created on 6/13/2022