Overview
The following describes the configuration schema used to synthesize a dataset in Oumi, including:
- Model identity
- Inference configuration
- Dataset synthesis configuration
Fields marked with (*) are required.
Schema structure
{
"modelIdentifier": {},
"inferenceConfig": {},
"synthesisConfig": {}
}
modelIdentifier *
Defines the identity and version of the model.
Properties
| Field | Type | Required | Description |
|---|
| modelType | string | ✓ | Category of model |
| modelName | string | ✓ | Human readable model name |
| modelId | string | ✓ | Unique identifier for the model |
| modelVersionId | string | ✓ | Version identifier for the model |
Constraints
| Field | Constraint |
|---|
| modelType | Minimum length: 1 |
Example
{
"modelIdentifier": {
"modelType": "llm",
"modelName": "Example Model",
"modelId": "model_123",
"modelVersionId": "v1"
}
}
inferenceConfig *
Controls runtime inference behavior.
Properties
| Field | Type | Required | Range | Description |
|---|
| inferenceTemperature | number | optional | 0–1 | Controls output randomness |
| inferenceMaxNewTokens | integer | optional | (0..9007199254740991] | Maximum tokens generated |
| inferenceSeed | integer | optional | [-9007199254740991..9007199254740991] | Seed for deterministic outputs |
| requestsPerMinute | integer | optional | (0..9007199254740991] | Rate limit for inference requests |
Example
{
"inferenceConfig": {
"inferenceTemperature": 0.7,
"inferenceMaxNewTokens": 512,
"inferenceSeed": 42,
"requestsPerMinute": 100
}
}
synthesisConfig *
Defines dataset synthesis configuration.
Properties
| Field | Type | Required | Description |
|---|
| synthesisType | string | ✓ | Type of synthesis configuration |
| synthesisConfig | object | ✓ | Dataset configuration |
Allowed values
| Field | Value |
|---|
| synthesisType | "general" |
synthesisConfig Object
Defines the dataset used and how it is split.
| Field | Type | Required | Description |
|---|
| datasetId | string | ✓ | Dataset identifier |
| datasetDistribution | object | ✓ | Dataset split configuration |
datasetDistribution
Defines the allocation of dataset samples.
| Field | Type | Required | Range | Description |
|---|
| train | number | ✓ | 0–1 | Training split fraction |
| validation | number | ✓ | 0–1 | Validation split fraction |
| test | number | ✓ | 0–1 | Test split fraction |
The sum of train + validation + test should typically equal 1.0.
Example
{
"synthesisConfig": {
"synthesisType": "general",
"synthesisConfig": {
"datasetId": "dataset_abc",
"datasetDistribution": {
"train": 0.8,
"validation": 0.1,
"test": 0.1
}
}
}
}
Complete example
{
"modelIdentifier": {
"modelType": "llm",
"modelName": "Example Model",
"modelId": "model_123",
"modelVersionId": "v1"
},
"inferenceConfig": {
"inferenceTemperature": 0.7,
"inferenceMaxNewTokens": 512,
"inferenceSeed": 42,
"requestsPerMinute": 100
},
"synthesisConfig": {
"synthesisType": "general",
"synthesisConfig": {
"datasetId": "dataset_abc",
"datasetDistribution": {
"train": 0.8,
"validation": 0.1,
"test": 0.1
}
}
}
}
Validation rules
modelType must have minimum length of 1
inferenceTemperature must be between 0 and 1
datasetDistribution values must each be between 0 and 1
train + validation + test should equal 1.0
- Integer values must fall within JavaScript safe integer range