1 of 5

Health reference design

In this section, you will find a health reference design that describes an end-to-end machine learning workflow for building a wearable health product using Edge Impulse.

We will utilize a publicly available dataset of PPG and accelerometer data that was collected from 15 subjects performing various activities, and emulate a clinical study to train a machine learning model that can classify activities.

Overview

The dataset selected to use in this example is the PPG-DaLiA dataset, which includes 15 subjects performing 9 activities, resulting in a total of 15 recordings. See the CSV file summary here, and read more about it in the publishers website here,

This dataset covers an activity study where data is recorded from a wearable end device (PPG + accelerometer), along with labels such as Stairs, Soccer, Cycling, Driving, Lunch, Walking, Working, Clean Baseline, and No Activity. The data is collected and validated, then written to a clinical dataset in an Edge Impulse organization, and finally imported into an Edge Impulse project where we train a classifier.

The health reference design builds transformation blocks that sync clinical data, validate the dataset, query the dataset, and transform the data to process raw data files into a unified dataset.

The design culminates in a data pipeline that handles data coming from multiple sources, data alignment, and a multi-stage pipeline before the data is imported into an Edge Impulse project.

We won't cover in detail all the code snippets, it should be straightforward to follow through, if issues are encountered our solution engineers can help you set this end-to-end ML workflow.

Health Reference Design Sections

This health reference design section helps you understand how to create a full clinical data pipeline by:

Synchronizing clinical data with a bucket: Collect and organize data from multiple sources into a sorted dataset.
Validating clinical data: Ensure the integrity and consistency of the dataset by applying checklists.
Querying clinical data: Explore and slice data using a query system.
Transforming clinical data: Process and transform raw data into a format suitable for machine learning.

Bringing it all together

After you have completed the health reference design, you can go further by combining the individual transformation steps into a data pipeline.

Refer to the following guide to learn how to build data pipelines:

Building data pipelines: Build pipelines to automate data processing steps.

The Health Reference Design pipeline consists of the following steps:

DataProcessor: Processes raw data files for each subject.
MetadataGenerator: Extracts and attaches metadata to each subject's data.
DataCombiner: Merges all processed data into a unified dataset.

Repository containing the blocks used in this health reference design:

https://github.com/edgeimpulse/health-reference-design-public-data

Data Pipeline Workflow

The data pipeline workflow for the Health Reference Design is as follows:

Creating and Running the Pipeline in Edge Impulse

Now that all transformation blocks are pushed to Edge Impulse, you can create a pipeline to chain them together.

Steps:

Access Pipelines:

In Edge Impulse Studio, navigate to your organization.
Go to Data > Pipelines.

Add a New Pipeline:

Click on + Add a new pipeline.
Name: PPG-DaLiA Data Processing Pipeline
Description: Processes PPG-DaLiA data from raw files to a combined dataset.

Configure Pipeline Steps:

Paste the following JSON configuration into the pipeline steps:

[
  {
   "name": "Process Subject Data",
   "filter": "name LIKE '%S%_E4%'",
   "uploadType": "dataset",
   "inputDatasetId": "raw-dataset",
   "outputDatasetId": "processed-dataset",
   "transformationBlockId": 1234, // Replace 1234 with your DataProcessor block ID
   "transformationParallel": 3,
   "parameters": {
    "in-directory": "."
   }
  },
  {
   "name": "Generate Metadata",
   "filter": "name LIKE '%S%_E4%'",
   "uploadType": "dataset",
   "inputDatasetId": "processed-dataset",
   "outputDatasetId": "processed-dataset",
   "transformationBlockId": 5678, // Replace 5678 with your MetadataGenerator block ID
   "transformationParallel": 3,
   "parameters": {
    "in-directory": "."
   }
  },
  {
   "name": "Combine Processed Data",
   "filter": "name LIKE '%'",
   "uploadType": "dataset",
   "inputDatasetId": "processed-dataset",
   "outputDatasetId": "combined-dataset",
   "transformationBlockId": 9101, // Replace 9101 with your DataCombiner block ID
   "transformationParallel": 1,
   "parameters": {
    "dataset-name": "ppg_dalia_combined.parquet"
   }
  }
]

Replace the transformationBlockId values with the actual IDs of your transformation blocks.

Save the Pipeline.
Run the Pipeline:

Click on the ⋮ (ellipsis) next to your pipeline.
Select Run pipeline now.

Monitor Execution:

Check the pipeline logs to ensure each step runs successfully.
Address any errors that may occur.

Verify Output:

After completion, verify that the datasets (processed-dataset and combined-dataset) have been created and populated.

Next Steps

After the pipeline has successfully run, you can import the combined dataset into an Edge Impulse project to train a machine learning model.

If you didn't complete the pipeline, don't worry, this is just a demonstration. However, you can still import the processed dataset from our HRV Analysis tutorial to train a model.

Refer to the following guides to learn how to import datasets into Edge Impulse:

HRV Analysis: Analyze Heart Rate Variability (HRV) data.
Activity Recognition: Classify activities using accelerometer data.
MLOps: Implement MLOps practices in your workflow.

Conclusion

The Health Reference Design provides a comprehensive overview of building a wearable health product using Edge Impulse. By following the steps outlined in this guide, you will gain a practical understanding of a real-world machine learning workflow for processing and analyzing clinical data.

If you have any questions or need assistance with implementing the Health Reference Design, feel free to reach out to our sales team for product development, or share your work on our forum.

Synchronizing clinical data with a bucket

In this section, we will show how to synchronize research data with a bucket in your organizational dataset. The goal of this step is to gather data from different sources and sort them to obtain a sorted dataset. We will then validate this dataset in the next section.

Only available with Edge Impulse Enterprise Plan

Try our FREE Enterprise Trial today.

The reference design described in the health reference design PPG-DaLiA DOI 10.24432/C53890 is a publicly available dataset for PPG-based heart rate estimation. This multimodal dataset features physiological and motion data, recorded from both a wrist- and a chest-worn device, of 15 subjects while performing a wide range of activities under close to real-life conditions. The included ECG data provides heart rate ground truth. The included PPG- and 3D-accelerometer data can be used for heart rate estimation, while compensating for motion artefacts. Details can be found in the dataset's readme-file.

File Name

Description

S1_activity.csv

Data containing labels of the activities.

S1_quest.csv

Data from the questionnaire, detailing the subjects' attributes.

ACC.csv

Data from 3-axis accelerometer sensor. The accelerometer is configured to measure acceleration in the range [-2g, 2g]. Therefore, the unit in this file is 1/64g. Data from x, y, and z axis are respectively in the first, second, and third column.

BVP.csv

Blood Volume Pulse (BVP) signal data from photoplethysmograph.

EDA.csv

Electrodermal Activity (EDA) data expressed as microsiemens (μS).

tags.csv

Tags for the data, e.g., Stairs, Soccer, Cycling, Driving, Lunch, Walking, Working, Clean Baseline, No Activity.

HR.csv

Heart Rate (HR) data, as measured by the wearable device. Average heart rate extracted from the BVP signal. The first row is the initial time of the session expressed as a Unix timestamp in UTC. The second row is the sample rate expressed in Hz.

IBI.csv

Inter-beat Interval (IBI) data. Time between individual heartbeats extracted from the BVP signal. No sample rate is needed for this file. The first column is the time (relative to the initial time) of the detected inter-beat interval expressed in seconds (s). The second column is the duration in seconds (s) of the detected inter-beat interval (i.e., the distance in seconds from the previous beat).

TEMP.csv

Data from temperature sensor expressed in degrees on the Celsius (°C) scale.

info.txt

Metadata about the participant, e.g., Age, Gender, Height, Weight, BMI.

You can download the complete set of subject 1 files here:

Download zip

We've mimicked a proper research study, and have split the data up into two locations.

Initial subject files (ACC.csv, BVP.csv, EDA.csv, HR.csv, IBI.csv, TEMP.csv, info.txt, S1_activity.csv, tags.csv) live in the company data lake in S3. The data lake uses an internal structure with non-human readable IDs for each participant (e.g. Subject 1 as S1_E4 for anonymized data):

Clinical_Dataset/
├── S1_E4/
│   ├── ACC.csv
│   ├── BVP.csv
│   ├── EDA.csv
│   ├── HR.csv
│   ├── IBI.csv
│   ├── TEMP.csv
│   ├── info.txt
│   ├── S1_activity.csv
│   ├── tags.csv

Other files are uploaded by the research partner to an upload portal. The files are prefixed with the subject ID:

├── S2_E4/
│   ├── ACC.csv
│   ├── BVP.csv
│   ├── EDA.csv
│   ├── HR.csv
│   ├── IBI.csv
│   ├── TEMP.csv
│   ├── info.txt
│   ├── S2_activity.csv
│   ├── tags.csv

with the directory S2_E4 indicating that this data is from the second subject in the study, or prefixing the files with S2_ (e.g. S2_activity.csv).

Anonymizing your data (optional)

This is a manual step that some countries regulations may require, this example is for reference, but not needed or used in this example.

To create the mapping between the study ID, subjects name, and the internal data lake ID we can use a study master sheet. It contains information about all participants, ID mapping, and metadata. E.g.:

Subject     Internal ID     Study date     Age     BMI
Subject_1 S1_E4         2022-01-01     25     22.5
Subject_2 S2_E4         2022-01-02     30     23.5

Notes: This master sheet was made using a Google Sheet but can be anything. All data (data lake, portal, output) are hosted in an Edge Impulse S3 bucket but can be stored anywhere (see below).

Configuring a storage bucket for your dataset

Data is stored in cloud storage buckets that are hosted in your own infrastructure. To configure a new storage bucket, head to your organization, choose Data > Buckets, click Add new bucket, and fill in your access credentials. For additional details, refer to Cloud data storage. Our solution engineers are also here to help you set up your buckets.

About datasets

With the storage bucket in place you can create your first dataset. Datasets in Edge Impulse have three layers:

Datasets in Edge Impulse have three layers:

Dataset: A larger set of data items grouped together.
Data item: An item with metadata and Data file attached.
Data file: The actual files.

Adding research data to your organization

There are three ways of uploading data into your organization. You can either:

Upload data directly to the storage bucket (recommended method). In this case use Add data... > Add dataset from bucket and the data will be discovered automatically.
Upload data through the Edge Impulse API.
Upload the files through the Upload Portals.

Sorter and combiner

Sorter

The sorter is the first step of the research pipeline. Its job is to fetch the data from all locations (here: internal data lake, portal, metadata from study master sheet) and create a research dataset in Edge Impulse. It does this by:

Creating a new structure in S3 like this:

S1_E4
|_ info.txt
|_ s1_activity.csv
|_ acc.csv
|_ bvp.csv
|_ eda.csv
|_ hr.csv
|_ ibi.csv
|_ temp.csv
|_ tags.csv
S2_E4
|_ info.txt
|_ s2_activity.csv
|_ acc.csv
|_ bvp.csv
|_ eda.csv
|_ hr.csv
|_ ibi.csv
|_ temp.csv
|_ tags.csv

Syncing the S3 folder with a research dataset in your Edge Impulse organization (like PPG-DaLiA Activity Study 2024).
Updating the metadata with the metadata from the master sheet (Age, BMI, etc...). Read on how to add and sync S3 data

Combiner

With the data sorted we then:

Need to verify that the data is correct (see validate your research data)
Combine the data into a single Parquet file. This is essentially the contract we have for our dataset. By settling on a standard format (strong typed, same column names everywhere) this data is now ready to be used for ML, new algorithm development, etc. Because we also add metadata for each file here we're very quickly building up a valuable R&D datastore.

No required format for data files

There is no required format for data files. You can upload data in any format, whether it's CSV, Parquet, or a proprietary data format.

Parquet is a columnar storage format that is optimized for reading and writing large datasets. It is particularly useful for data that is stored in S3 buckets, as it can be read in parallel and is highly compressed. That is why we are converting the data to Parquet in the transform block code.

See Parquet for more information. or an example in our Create a Transform Block Doc

All these steps can be run through different transformation blocks and executed one after the other using data pipelines.

Validating clinical data

Only available with Edge Impulse Enterprise Plan

Try our FREE Enterprise Trial today.

Using Checklists

You can optionally show a check mark in the list of data items, and show a check list for data items. This can be used to quickly view which data items are complete (if you need to capture data from multiple sources) or whether items are in the right format.

Checklists look trivial, but are actually very powerful as they give quick insights in dataset issues. Missing these issues until after the study is done can be very expensive.

Checklists are written to ei-metadata.json and are automatically being picked up by the UI.

Checklists are driven by the metadata for a data item. Set the ei_check metadata item to 0 or 1 to show a check mark in the list. To show an item in the checklist, set an ei_check_KEYNAME metadata item to 0 or 1.

To query for items with or without a check mark, use a filter in the form of:

metadata->ei_check = 1

To make it easy to create these lists on the fly you can set these metadata items directly from a transformation block

Example

For the reference design described and used in the previous pages, the combiner takes in a data item, and writes out:

A checklist, e.g.:
- ✔ - PPG file present
- ✔ - Accelerometer file present
- ✘ - Correlation between HR/PPG HR is at least 0.5
If the checklist is OK, a combined.parquet file.
A hr.png file with the correlation between HR found from PPG, and HR from the reference device. This is useful for two reasons:
- If the correlation is too low we're looking at the wrong file, or data is missing.
- Verify if the PPG => HR algorithm actually works.

This makes it easy to quickly see if the data is in the right format, and if the data is complete. If the checklist is not OK, the data item is not used in the training set.

Querying clinical data

Organizational datasets contain a powerful query system which lets you explore and slice data. You control the query system through the 'Filter' text box, and you use a language which is very similar to SQL (documentation).

Only available with Edge Impulse Enterprise Plan

Try our FREE Enterprise Trial today.

For example, here are some queries that you can make:

dataset like '%PPG%' - returns all items and files from the study.
bucket_name = 'edge-impulse-health-reference-design' AND -- labels sitting,walking - returns data whose label is 'sitting' and 'walking, and that is stored in the 'edge-impulse-health-reference-design' bucket.
metadata->>'ei_check' = 0 - return data that have a metadata field 'ei_check' which is '0'.
created > DATE('2022-08-01') - returns all data that was created after Aug 1, 2022.

After you've created a filter, you can select one or more data items, and select Actions...>Download selected to create a ZIP file with the data files. The file count reflects the number of files returned by the filter.

The previous queries all returned all files for a data item. But you can also query files through the same filter. In that case the data item will be returned, but only with the files selected. For example:

file_name LIKE '%.png' - returns all files that end with .png.

If you have an interesting query that you'd like to share with your colleagues, you can just share the URL. The query is already added to it automatically.

All available fields

These are all the available fields in the query interface:

dataset - Dataset.
bucket_id - Bucket ID.
bucket_name - Bucket name.
bucket_path - Path of the data item within the bucket.
id - Data item ID.
name - Data item name.
total_file_count - Number of files for the data item.
total_file_size - Total size of all files for the data item.
created - When the data item was created.
metadata->key - Any item listed under 'metadata'.
file_name - Name of a file.
file_names - All filenames in the data item, that you can use in conjunction with CONTAINS. E.g. find all items with file X, but not file Y: file_names CONTAINS 'x' AND not file_names CONTAINS 'y'.

Transforming clinical data

Transformation blocks take raw data from your organizational datasets and convert the data into a different dataset or files that can be loaded in an Edge Impulse project. You can use transformation blocks to only include certain parts of individual data files, calculate long-running features like a running mean or derivatives, or efficiently generate features with different window lengths. Transformation blocks can be written in any language, and run on the Edge Impulse infrastructure.

No required format for data files

There is no required format for data files. You can upload data in any format, whether it's CSV, Parquet, or a proprietary data format.

The PPG-DaLiA dataset is a multimodal collection featuring physiological and motion data recorded from 15 subjects.

In this tutorial we build a Python-based transformation block that loads Parquet files, we process the dataset by calculating features and transforming it into a unified schema suitable for machine learning. If you haven't done so, go through synchronizing clinical data with a bucket first.

Only available with Edge Impulse Enterprise Plan

Try our FREE Enterprise Trial today.

1. Prerequisites

You'll need:

The Edge Impulse CLI.
- If you receive any warnings that's fine. Run edge-impulse-blocks afterwards to verify that the CLI was installed correctly.
PPG-DaLiA CSV files: Download files like ACC.csv, HR.csv, EDA.csv, etc., which contain sensor data.

Transformation blocks use Docker containers, a virtualization technique that lets developers package up an application with all dependencies in a single package. If you want to test your blocks locally you'll also need (this is not a requirement):

Docker desktop installed on your machine.

2. Building your first transformation block

To build a transformation block open a command prompt or terminal window, create a new folder, and run:

$ edge-impulse-blocks init

This will prompt you to log in, and enter the details for your block. E.g.:

Edge Impulse Blocks v1.9.0
? What is your user name or e-mail address (edgeimpulse.com)? user@example.com
? What is your password? [hidden]
Attaching block to organization 'Demo Organization'
? Choose a type of block Transformation block
? Choose an option Create a new block
? Enter the name of your block DaLiA Transformation
? Enter the description of your block Process DaLiA data and extract accelerometer features
Creating block with config: {
  name: 'DaLiA Transformation',
  type: 'transform',
  description: 'Processes accelerometer and activity data from the PPG-DaLiA dataset',
  organizationId: 34
}

Then, create the following files in this directory:

2.1 - Dockerfile

We're building a Python based transformation block. The Dockerfile describes our base image (Python 3.7.5), our dependencies (in requirements.txt) and which script to run (transform.py).

FROM python:3.7.5-stretch

WORKDIR /app

# Python dependencies
COPY requirements.txt ./
RUN pip3 --no-cache-dir install -r requirements.txt

COPY . ./

ENTRYPOINT [ "python3",  "transform.py" ]

Note: Do not use a WORKDIR under /home! The /home path will be mounted in by Edge Impulse, making your files inaccessible.

ENTRYPOINT vs RUN / CMD

If you use a different programming language, make sure to use ENTRYPOINT to specify the application to execute, rather than RUN or CMD.

2.2 - requirements.txt

This file describes the dependencies for the block. We'll be using pandas and pyarrow to parse the Parquet file, and numpy to do some calculations.

numpy==1.16.4
pandas==0.23.4
pyarrow==0.16.0

2.3 - transform.py

This file includes the actual application. Transformation blocks are invoked with three parameters (as command line arguments):

--in-file or --in-directory - A file (if the block operates on a file), or a directory (if the block operates on a data item) from the organizational dataset. In this case the unified_data.parquet file.
--out-directory - Directory to write files to.
--hmac-key - You can use this HMAC key to sign the output files. This is not used in this tutorial.
--metadata - Key/value pairs containing the metadata for the data item, plus additional metadata about the data item in the dataItemInfo key. E.g.: { "subject": "AAA001", "ei_check": "1", "dataItemInfo": { "id": 101, "dataset": "Human Activity 2022", "bucketName": "edge-impulse-tutorial", "bucketPath": "janjongboom/human_activity/AAA001/", "created": "2022-03-07T09:20:59.772Z", "totalFileCount": 14, "totalFileSize": 6347421 } }

Add the following content. This takes in the Parquet file, groups data by their label, and then calculates the RMS over the X, Y and Z axes of the accelerometer.

toggle to expand code

transform.py

import numpy as np
import os
import argparse
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Parse arguments
parser = argparse.ArgumentParser(description='Simple transformation block for single directory PPG-DaLiA data processing')
parser.add_argument('--in-directory', type=str, required=True, help="Path to the directory containing the CSV files")
parser.add_argument('--out-directory', type=str, required=True, help="Path to save the transformed Parquet file")
args = parser.parse_args()

# Check input and output directories
if not os.path.exists(args.in_directory):
    print(f"Data directory {args.in_directory} does not exist.", flush=True)
    exit(1)

if not os.path.exists(args.out_directory):
    os.makedirs(args.out_directory)

# Define paths to the necessary CSV files
acc_file = os.path.join(args.in_directory, 'ACC.csv')
hr_file = os.path.join(args.in_directory, 'HR.csv')
eda_file = os.path.join(args.in_directory, 'EDA.csv')
bvp_file = os.path.join(args.in_directory, 'BVP.csv')
temp_file = os.path.join(args.in_directory, 'TEMP.csv')
activity_file = os.path.join(args.in_directory, 'S1_activity.csv')

# Check if all required files are available
for file_path in [acc_file, hr_file, eda_file, bvp_file, temp_file, activity_file]:
    if not os.path.exists(file_path):
        print(f"Missing file {file_path}. Skipping processing.", flush=True)
        exit(1)

# Load data from CSV files
acc_data = pd.read_csv(acc_file, header=None, skiprows=2, names=['accX', 'accY', 'accZ'])
hr_data = pd.read_csv(hr_file, header=None, skiprows=2, names=['heart_rate'])
eda_data = pd.read_csv(eda_file, header=None, skiprows=2, names=['eda'])
bvp_data = pd.read_csv(bvp_file, header=None, skiprows=2, names=['bvp'])
temp_data = pd.read_csv(temp_file, header=None, skiprows=2, names=['temperature'])

# Load and clean activity labels
activity_labels = pd.read_csv(activity_file, header=None, skiprows=1, names=['activity', 'start_row'])
activity_labels['activity'] = activity_labels['activity'].str.strip()  # Remove leading/trailing whitespace
activity_labels['start_row'] = pd.to_numeric(activity_labels['start_row'], errors='coerce')  # Convert to numeric, setting invalid parsing to NaN
activity_labels = activity_labels.dropna(subset=['start_row']).reset_index(drop=True)  # Remove invalid rows and reset index
activity_labels['start_row'] = activity_labels['start_row'].astype(int)

# Set default activity and map activities to rows based on start_row intervals
acc_data['activity'] = 'NO_ACTIVITY'  # Default activity
for i in range(len(activity_labels) - 1):
    activity = activity_labels.loc[i, 'activity']
    start_row = activity_labels.loc[i, 'start_row']
    end_row = activity_labels.loc[i + 1, 'start_row']
    acc_data.loc[start_row:end_row - 1, 'activity'] = activity

# Handle the last activity to the end of the dataset
last_activity = activity_labels.iloc[-1]['activity']
last_start_row = activity_labels.iloc[-1]['start_row']
acc_data.loc[last_start_row:, 'activity'] = last_activity

# Calculate features
acc_features = {
    'accX_rms': np.sqrt(np.mean(acc_data['accX']**2)),
    'accY_rms': np.sqrt(np.mean(acc_data['accY']**2)),
    'accZ_rms': np.sqrt(np.mean(acc_data['accZ']**2)),
}
hr_mean = hr_data['heart_rate'].mean()
eda_mean = eda_data['eda'].mean()
bvp_mean = bvp_data['bvp'].mean()
temp_mean = temp_data['temperature'].mean()

# Combine features and unique activity labels
features = {
    **acc_features,
    'heart_rate_mean': hr_mean,
    'eda_mean': eda_mean,
    'bvp_mean': bvp_mean,
    'temperature_mean': temp_mean,
    'activity_labels': [activity_labels['activity'].tolist()]  # Nest the list of activities
}

# Convert features to DataFrame and save as Parquet file
features_df = pd.DataFrame([features])
out_file = os.path.join(args.out_directory, 'unified_data.parquet')
table = pa.Table.from_pandas(features_df)
pq.write_table(table, out_file)

print(f'Written features Parquet file: {out_file}', flush=True)

Docker

You can also build the container locally via Docker, and test the block. The added benefit is that you don't need any dependencies installed on your local computer, and can thus test that you've included everything that's needed for the block. This requires Docker desktop to be installed.

To build the container and test the block, open a command prompt or terminal window and navigate to the source directory. First, build the container:

docker build -t ppg-dalia-transform .

[+] Building 1.8s (13/13) FINISHED                                                                                                                                          docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                        0.0s
 => => transferring dockerfile: 402B                                                                                                                                                        0.0s
 => [internal] load metadata for docker.io/library/ubuntu:20.04                                                                                                                             0.9s
 => [auth] library/ubuntu:pull token for registry-1.docker.io                                                                                                                               0.0s
 => [internal] load .dockerignore                                                                                                                                                           0.0s
 => => transferring context: 2B                                                                                                                                                             0.0s
 => [1/7] FROM docker.io/library/ubuntu:20.04@sha256:8e5c......7555141b                                                                       0.0s
 => [internal] load build context                                                                                                                                                           0.0s
 => => transferring context: 4.75MB                                                                                                                                                         0.0s
 => CACHED [2/7] WORKDIR /app                                                                                                                                                               0.0s
 => CACHED [3/7] RUN apt update && apt install -y python3 python3-distutils wget                                                                                                            0.0s
 => CACHED [4/7] RUN wget https://bootstrap.pypa.io/get-pip.py &&     python3.8 get-pip.py "pip==21.3.1" &&     rm get-pip.py                                                               0.0s
 => CACHED [5/7] COPY requirements.txt ./                                                                                                                                                   0.0s
 => CACHED [6/7] RUN pip3 --no-cache-dir install -r requirements.txt                                                                                                                        0.0s
 => [7/7] COPY . ./                                                                                                                                                                         0.5s
 => exporting to image                                                                                                                                                                      0.3s
 => => exporting layers                                                                                                                                                                     0.3s
 => => writing image sha256:856a1ec5eb879c904.......88e5e48899a                                                                                                0.0s
 => => naming to docker.io/library/ppg-dalia-transform                                                                                                                                      0.0s

View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/rnpnsjzniokmbvx29fj4cs0x3

What's next:
    View a summary of image vulnerabilities and recommendations → docker scout quickview

Then, run the container (make sure unified_data.parquet is in the same directory):

$ docker run --rm -v $PWD:/data test-org-transform-parquet-dataset --in-file /data/unified_data.parquet --out-directory /data/out

Seeing the output

This process has generated a new Parquet file in the out/ directory containing the RMS of the X, Y and Z axes. If you inspect the content of the file (e.g. using parquet-tools) you'll see the output:

If you don't have parquet-tools installed, you can install it via:

$ pip install parquet-tools

Then, run:

$  parquet-tools inspect out/unified_data.parquet

This will show you the metadata and the columns in the file:

code output block:

$
a############ file meta data ############
created_by: parquet-cpp-arrow version 18.0.0-SNAPSHOT
num_columns: 7
num_rows: 1
num_row_groups: 1
format_version: 2.6
serialized_size: 4373


############ Columns ############
accX_rms
accY_rms
accZ_rms
heart_rate_mean
eda_mean
bvp_mean
temperature_mean

############ Column(accX_rms) ############
name: accX_rms
path: accX_rms
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)

############ Column(accY_rms) ############
name: accY_rms
path: accY_rms
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)

############ Column(accZ_rms) ############
name: accZ_rms
path: accZ_rms
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)


Success!

3. Pushing the transformation block to Edge Impulse

With the block ready we can push it to your organization. Open a command prompt or terminal window, navigate to the folder you created earlier, and run:

$ edge-impulse-blocks push

This packages up your folder, sends it to Edge Impulse where it'll be built, and finally is added to your organization.

Edge Impulse Blocks v1.9.0
Archiving 'tutorial-processing-block'...
Archiving 'tutorial-processing-block' OK (2 KB) /var/folders/3r/fds0qzv914ng4t17nhh5xs5c0000gn/T/ei-transform-block-7812190951a6038c2f442ca02d428c59.tar.gz

Uploading block 'Demo dalia-ppg transformation' to organization 'Moe's Demo Org'...
Uploading block 'Demo dalia-ppg transformation' to organization 'Demo org Inc.' OK

Building transformation block 'Demo dalia-ppg transformation'...
Job started
...
Building transformation block 'Demo dalia-ppg transformation' OK

Your block has been updated, go to https://studio.edgeimpulse.com/organization/34/data to run a new transformation

The transformation block is now available in Edge Impulse under Data transformation > Transformation blocks.

If you make any changes to the block, just re-run edge-impulse-blocks push and the block will be updated.

4. Uploading unified_data.parquet to Edge Impulse

Next, upload the unified_data.parquet file, by going to Data > Add data... > Add data item, setting name as 'Gestures', dataset to 'Transform tutorial', and selecting the Parquet file.

This makes the unified_data.parquet file available from the Data page.

5. Starting the transformation

With the Parquet file in Edge Impulse and the transformation block configured you can now create a new job. Go to Data, and select the Parquet file by setting the filter to dataset = 'Transform tutorial'.

Click the checkbox next to the data item, and select Transform selected (1 file). On the 'Create transformation job' page select 'Import data into Dataset'. Under 'output dataset', select 'Same dataset as source', and under 'Transformation block' select the new transformation block.

Click Start transformation job to start the job. This pulls the data in, starts a transformation job and finally uploads the data back to your dataset. If you have multiple files selected the transformations will also run in parallel.

You can now find the transformed file back in your dataset.

6. Next steps

Transformation blocks are a powerful feature which let you set up a data pipeline to turn raw data into actionable machine learning features. It also gives you a reproducible way of transforming many files at once, and is programmable through the Edge Impulse API so you can automatically convert new incoming data. If you're interested in transformation blocks or any of the other enterprise features, let us know!

🚀

Appendix: Advanced features

Updating metadata from a transformation block

You can update the metadata of blocks directly from a transformation block by creating a ei-metadata.json file in the output directory. The metadata is then applied to the new data item automatically when the transform job finishes. The ei-metadata.json file has the following structure:

{
    "version": 1,
    "action": "add",
    "metadata": {
        "some-key": "some-value"
    }
}

Some notes:

If action is set to add the metadata keys are added to the data item. If action is set to replace all existing metadata keys are removed.

Environmental variables

Transformation blocks get access to the following environmental variables, which let you authenticate with the Edge Impulse API. This way you don't have to inject these credentials into the block. The variables are:

EI_API_KEY - an API key with 'member' privileges for the organization.
EI_ORGANIZATION_ID - the organization ID that the block runs in.
EI_API_ENDPOINT - the API endpoint (default: https://studio.edgeimpulse.com/v1).

Custom parameters

You can specify custom arguments or parameters to your block by adding a parameters.json file in the root of your block directory. This file describes all arguments for your training pipeline, and is used to render custom UI elements for each parameter. For example, this parameters file:

[{
    "name": "Bucket",
    "type": "bucket",
    "param": "bucket-name",
    "value": "",
    "help": "The bucket where you're hosting all data"
},
{
    "name": "Bucket prefix",
    "value": "my-test-prefix/",
    "type": "string",
    "param": "bucket-prefix",
    "help": "The prefix in the bucket, where you're hosting the data"
}]

Renders the following UI when you run the transformation block:

And the options are passed in as command line arguments to your block:

--bucket-name "ei-data-dev" --bucket-prefix "my-test-prefix/"

For more information, and all options see parameters.json.