In this section, we will show how to synchronize research data with a bucket in your organizational dataset. The goal of this step is to gather data from different sources and sort them to obtain a sorted dataset. We will then validate this dataset in the next section.
Only available with Edge Impulse Enterprise Plan
Try our FREE Enterprise Trial today.
The reference design described in the health reference design PPG-DaLiA DOI 10.24432/C53890 is a publicly available dataset for PPG-based heart rate estimation. This multimodal dataset features physiological and motion data, recorded from both a wrist- and a chest-worn device, of 15 subjects while performing a wide range of activities under close to real-life conditions. The included ECG data provides heart rate ground truth. The included PPG- and 3D-accelerometer data can be used for heart rate estimation, while compensating for motion artefacts. Details can be found in the dataset's readme-file.
File Name
Description
S1_activity.csv
Data containing labels of the activities.
S1_quest.csv
Data from the questionnaire, detailing the subjects' attributes.
ACC.csv
Data from 3-axis accelerometer sensor. The accelerometer is configured to measure acceleration in the range [-2g, 2g]. Therefore, the unit in this file is 1/64g. Data from x, y, and z axis are respectively in the first, second, and third column.
BVP.csv
Blood Volume Pulse (BVP) signal data from photoplethysmograph.
EDA.csv
Electrodermal Activity (EDA) data expressed as microsiemens (μS).
tags.csv
Tags for the data, e.g., Stairs, Soccer, Cycling, Driving, Lunch, Walking, Working, Clean Baseline, No Activity.
HR.csv
Heart Rate (HR) data, as measured by the wearable device. Average heart rate extracted from the BVP signal. The first row is the initial time of the session expressed as a Unix timestamp in UTC. The second row is the sample rate expressed in Hz.
IBI.csv
Inter-beat Interval (IBI) data. Time between individual heartbeats extracted from the BVP signal. No sample rate is needed for this file. The first column is the time (relative to the initial time) of the detected inter-beat interval expressed in seconds (s). The second column is the duration in seconds (s) of the detected inter-beat interval (i.e., the distance in seconds from the previous beat).
TEMP.csv
Data from temperature sensor expressed in degrees on the Celsius (°C) scale.
info.txt
Metadata about the participant, e.g., Age, Gender, Height, Weight, BMI.
You can download the complete set of subject 1 files here:
We've mimicked a proper research study, and have split the data up into two locations.
Initial subject files (ACC.csv, BVP.csv, EDA.csv, HR.csv, IBI.csv, TEMP.csv, info.txt, S1_activity.csv, tags.csv) live in the company data lake in S3. The data lake uses an internal structure with non-human readable IDs for each participant (e.g. Subject 1 as S1_E4
for anonymized data):
Other files are uploaded by the research partner to an upload portal. The files are prefixed with the subject ID:
with the directory S2_E4
indicating that this data is from the second subject in the study, or prefixing the files with S2_
(e.g. S2_activity.csv
).
This is a manual step that some countries regulations may require, this example is for reference, but not needed or used in this example.
To create the mapping between the study ID, subjects name, and the internal data lake ID we can use a study master sheet. It contains information about all participants, ID mapping, and metadata. E.g.:
Notes: This master sheet was made using a Google Sheet but can be anything. All data (data lake, portal, output) are hosted in an Edge Impulse S3 bucket but can be stored anywhere (see below).
Data is stored in storage buckets, which can either be hosted by Edge Impulse, or in your own infrastructure. If you choose to host the data yourself your infrastructure should be available through the S3 API, and you are responsible for setting up proper backups. To configure a new storage bucket, head to your organization, choose Data > Buckets, click Add new bucket, and fill in your access credentials. Our solution engineers are also here to help you set up the buckets for you.
With the storage bucket in place you can create your first dataset. Datasets in Edge Impulse have three layers:
Datasets in Edge Impulse have three layers:
Dataset: A larger set of data items grouped together.
Data item: An item with metadata and Data file attached.
Data file: The actual files.
There are three ways of uploading data into your organization. You can either:
Upload data directly to the storage bucket (recommended method). In this case use Add data... > Add dataset from bucket and the data will be discovered automatically.
Upload data through the Edge Impulse API.
Upload the files through the Upload Portals.
The sorter is the first step of the research pipeline. Its job is to fetch the data from all locations (here: internal data lake, portal, metadata from study master sheet) and create a research dataset in Edge Impulse. It does this by:
Creating a new structure in S3 like this:
Syncing the S3 folder with a research dataset in your Edge Impulse organization (like PPG-DaLiA Activity Study 2024
).
Updating the metadata with the metadata from the master sheet (Age
, BMI
, etc...). Read on how to add and sync S3 data
With the data sorted we then:
Need to verify that the data is correct (see validate your research data)
Combine the data into a single Parquet file. This is essentially the contract we have for our dataset. By settling on a standard format (strong typed, same column names everywhere) this data is now ready to be used for ML, new algorithm development, etc. Because we also add metadata for each file here we're very quickly building up a valuable R&D datastore.
No required format for data files
There is no required format for data files. You can upload data in any format, whether it's CSV, Parquet, or a proprietary data format.
Parquet is a columnar storage format that is optimized for reading and writing large datasets. It is particularly useful for data that is stored in S3 buckets, as it can be read in parallel and is highly compressed. That is why we are converting the data to Parquet in the transform block code.
see Parquet for more information. or an example in our Create a Transform Block Doc
All these steps can be run through different transformation blocks and executed one after the other using data pipelines.
Organizational datasets contain a powerful query system which lets you explore and slice data. You control the query system through the 'Filter' text box, and you use a language which is very similar to SQL ().
Only available with Edge Impulse Enterprise Plan
Try our FREE today.
For example, here are some queries that you can make:
dataset like '%PPG%'
- returns all items and files from the study.
bucket_name = 'edge-impulse-health-reference-design' AND -- labels sitting,walking
- returns data whose label is 'sitting' and 'walking, and that is stored in the 'edge-impulse-health-reference-design' bucket.
metadata->>'ei_check' = 0
- return data that have a metadata field 'ei_check' which is '0'.
created > DATE('2022-08-01')
- returns all data that was created after Aug 1, 2022.
After you've created a filter, you can select one or more data items, and select Actions...>Download selected to create a ZIP file with the data files. The file count reflects the number of files returned by the filter.
The previous queries all returned all files for a data item. But you can also query files through the same filter. In that case the data item will be returned, but only with the files selected. For example:
file_name LIKE '%.png'
- returns all files that end with .png
.
If you have an interesting query that you'd like to share with your colleagues, you can just share the URL. The query is already added to it automatically.
These are all the available fields in the query interface:
dataset
- Dataset.
bucket_id
- Bucket ID.
bucket_name
- Bucket name.
bucket_path
- Path of the data item within the bucket.
id
- Data item ID.
name
- Data item name.
total_file_count
- Number of files for the data item.
total_file_size
- Total size of all files for the data item.
created
- When the data item was created.
metadata->key
- Any item listed under 'metadata'.
file_name
- Name of a file.
file_names
- All filenames in the data item, that you can use in conjunction with CONTAINS
. E.g. find all items with file X, but not file Y: file_names CONTAINS 'x' AND not file_names CONTAINS 'y'
.