Split dataset
Split dataset
Performs a deterministic, in-place split of the project's dataset into "training", "testing", and optional "validation" sets. Split balancing can use the label, one or more metadata keys, or both as a composite grouping signal. Related samples can also be kept together across splits by metadata key. This is a deterministic process based on the hash of the name of the data. Returns immediately on small datasets, or starts a job on larger datasets.
For example:
{ "trainingSplitRatio": 0.8, "testingSplitRatio": 0.1, "validationSplitRatio": 0.1, "excludeDisabledSamples": false, "stratifyBy": { "label": true, "metadataKeys": ["site", "scanner"] }, "keepTogetherMetadataKeys": ["capture_group"] }
With these options, label/site/scanner are used to balance the split, while samples sharing the same capture_group value stay in the same split bucket.
POST
Split dataset
Authorizations
Path Parameters
Project ID
Body
application/json
Proportion of the dataset to use for training.
Proportion of the dataset to use for testing.
Proportion of the dataset to use for validation. This is experimental and may change in the future.
Whether to exclude samples that are marked as disabled.
Optional balancing targets for the split.
List of metadata keys whose matching values must stay together in a single split. This is useful for leakage prevention across train, validation, and test.