Skip to main content

Overview

The following describes the configuration schema used to synthesize a dataset in Oumi, including:
  • Model identity
  • Inference configuration
  • Dataset synthesis configuration
Fields marked with (*) are required.

Schema structure

{
  "modelIdentifier": {},
  "inferenceConfig": {},
  "synthesisConfig": {}
}

modelIdentifier *

Defines the identity and version of the model.

Properties

FieldTypeRequiredDescription
modelTypestringCategory of model
modelNamestringHuman readable model name
modelIdstringUnique identifier for the model
modelVersionIdstringVersion identifier for the model

Constraints

FieldConstraint
modelTypeMinimum length: 1

Example

{
  "modelIdentifier": {
    "modelType": "llm",
    "modelName": "Example Model",
    "modelId": "model_123",
    "modelVersionId": "v1"
  }
}

inferenceConfig *

Controls runtime inference behavior.

Properties

FieldTypeRequiredRangeDescription
inferenceTemperaturenumberoptional0–1Controls output randomness
inferenceMaxNewTokensintegeroptional(0..9007199254740991]Maximum tokens generated
inferenceSeedintegeroptional[-9007199254740991..9007199254740991]Seed for deterministic outputs
requestsPerMinuteintegeroptional(0..9007199254740991]Rate limit for inference requests

Example

{
  "inferenceConfig": {
    "inferenceTemperature": 0.7,
    "inferenceMaxNewTokens": 512,
    "inferenceSeed": 42,
    "requestsPerMinute": 100
  }
}

synthesisConfig *

Defines dataset synthesis configuration.

Properties

FieldTypeRequiredDescription
synthesisTypestringType of synthesis configuration
synthesisConfigobjectDataset configuration

Allowed values

FieldValue
synthesisType"general"

synthesisConfig Object

Defines the dataset used and how it is split.
FieldTypeRequiredDescription
datasetIdstringDataset identifier
datasetDistributionobjectDataset split configuration

datasetDistribution

Defines the allocation of dataset samples.
FieldTypeRequiredRangeDescription
trainnumber0–1Training split fraction
validationnumber0–1Validation split fraction
testnumber0–1Test split fraction
The sum of train + validation + test should typically equal 1.0.

Example

{
  "synthesisConfig": {
    "synthesisType": "general",
    "synthesisConfig": {
      "datasetId": "dataset_abc",
      "datasetDistribution": {
        "train": 0.8,
        "validation": 0.1,
        "test": 0.1
      }
    }
  }
}

Complete example

{
  "modelIdentifier": {
    "modelType": "llm",
    "modelName": "Example Model",
    "modelId": "model_123",
    "modelVersionId": "v1"
  },
  "inferenceConfig": {
    "inferenceTemperature": 0.7,
    "inferenceMaxNewTokens": 512,
    "inferenceSeed": 42,
    "requestsPerMinute": 100
  },
  "synthesisConfig": {
    "synthesisType": "general",
    "synthesisConfig": {
      "datasetId": "dataset_abc",
      "datasetDistribution": {
        "train": 0.8,
        "validation": 0.1,
        "test": 0.1
      }
    }
  }
}

Validation rules

  • modelType must have minimum length of 1
  • inferenceTemperature must be between 0 and 1
  • datasetDistribution values must each be between 0 and 1
  • train + validation + test should equal 1.0
  • Integer values must fall within JavaScript safe integer range