Overview
This document describes the schema used to configure an evaluation.
The recipe defines:
- The model being evaluated
- The evaluators (judges) used
- Inference settings
- The dataset used for evaluation
- Optional failure mode analysis
Fields marked with (*) are required.
Schema structure
{
"recipe": {
"recipeConfig": {
"type": "evaluate",
"evaluationConfig": {}
}
}
}
recipe *
Root object containing the evaluation configuration.
| Field | Type | Required | Description |
|---|
| recipeConfig | object | ✓ | Defines the evaluation recipe configuration |
recipeConfig *
Configuration describing the recipe.
| Field | Type | Required | Description |
|---|
| type | string | ✓ | Type of recipe |
| evaluationConfig | object | ✓ | Evaluation configuration |
Allowed values
evaluationConfig *
Defines how the model evaluation should run.
| Field | Type | Required | Description |
|---|
| evaluationType | string | ✓ | Evaluation strategy |
| modelIdentifier | object | ✓ | Model to evaluate |
| evaluators | object[] | ✓ | List of evaluators to run |
| inferenceConfig | object | ✓ | Inference configuration |
| dataset | object | ✓ | Dataset used for evaluation |
| generateFailureModes | boolean | ✓ | Whether to analyze evaluation failures |
evaluationType
Defines the evaluation approach.
| Allowed Values |
|---|
"single_model" |
modelIdentifier *
Specifies the model being evaluated.
| Field | Type | Required | Description |
|---|
| modelType | enum | ✓ | Model provider type |
| modelName | string | ✓ | Model name or identifier |
| modelId | number | | Platform model ID (custom models) |
| modelVersionId | number | | Specific model version (latest if omitted) |
| apiKeys | object | | Optional API keys |
modelType
Supported model providers.
| Value | Description |
|---|
CUSTOM_CLOUD_STORAGE | Custom model stored in platform storage |
ANTHROPIC_API | Anthropic API model |
OPENAI_API | OpenAI API model |
GEMINI_API | Google Gemini API model |
VERTEX_API | Google Vertex AI model |
OUMI_API | Oumi API model |
apiKeys
Optional API credentials if not using platform credentials.
| Field | Type | Description |
|---|
| anthropic | string | Anthropic API key |
| openai | string | OpenAI API key |
| googleGemini | string | Google Gemini API key |
| googleVertex | string | Google Vertex AI API key |
evaluators *
List of judge evaluators used during evaluation.
| Field | Type | Required | Description |
|---|
| evaluatorId | number | ✓ | ID of the evaluator |
| evaluatorVersion | number | | Evaluator version (latest if omitted) |
Example
{
"evaluators": [
{
"evaluatorId": 10,
"evaluatorVersion": 2
}
]
}
inferenceConfig *
Defines inference parameters for the evaluated model.
| Field | Type | Required | Description |
|---|
| inferenceTemperature | number | | Sampling temperature |
| inferenceMaxNewTokens | number | | Maximum tokens generated |
| inferenceSeed | number | | Random seed for reproducibility |
| requestsPerMinute | number | | API rate limit |
dataset *
Defines the dataset used for evaluation.
| Field | Type | Required | Description |
|---|
| datasetId | number | ✓ | Dataset identifier |
| datasetVersion | number | | Dataset version (latest if omitted) |
generateFailureModes *
Controls failure analysis generation.
| Value | Description |
|---|
true | Analyze and categorize model failures |
false | Skip failure mode analysis |
Failure modes may include:
- hallucination
- incorrect reasoning
- missing information
- formatting errors
Complete example
{
"recipe": {
"recipeConfig": {
"type": "evaluate",
"evaluationConfig": {
"evaluationType": "single_model",
"modelIdentifier": {
"modelType": "OPENAI_API",
"modelName": "gpt-4.1",
"apiKeys": {
"openai": "sk-xxxxxxxx"
}
},
"evaluators": [
{
"evaluatorId": 12
}
],
"inferenceConfig": {
"inferenceTemperature": 0.2,
"inferenceMaxNewTokens": 512
},
"dataset": {
"datasetId": 42
},
"generateFailureModes": true
}
}
}
}
Validation rules
recipeConfig.type must equal “evaluate”
evaluationType must equal “single_model”
modelName is required
modelType must match one of the supported providers
datasetId must reference a valid dataset
- At least one evaluator must be provided
generateFailureModes must be a boolean