Skip to main content

Overview

This document describes the schema used to configure an evaluation. The recipe defines:
  • The model being evaluated
  • The evaluators (judges) used
  • Inference settings
  • The dataset used for evaluation
  • Optional failure mode analysis
Fields marked with (*) are required.

Schema structure

{
  "recipe": {
    "recipeConfig": {
      "type": "evaluate",
      "evaluationConfig": {}
    }
  }
}

recipe *

Root object containing the evaluation configuration.
FieldTypeRequiredDescription
recipeConfigobjectDefines the evaluation recipe configuration

recipeConfig *

Configuration describing the recipe.
FieldTypeRequiredDescription
typestringType of recipe
evaluationConfigobjectEvaluation configuration

Allowed values

FieldValue
type"evaluate"

evaluationConfig *

Defines how the model evaluation should run.
FieldTypeRequiredDescription
evaluationTypestringEvaluation strategy
modelIdentifierobjectModel to evaluate
evaluatorsobject[]List of evaluators to run
inferenceConfigobjectInference configuration
datasetobjectDataset used for evaluation
generateFailureModesbooleanWhether to analyze evaluation failures

evaluationType

Defines the evaluation approach.
Allowed Values
"single_model"

modelIdentifier *

Specifies the model being evaluated.
FieldTypeRequiredDescription
modelTypeenumModel provider type
modelNamestringModel name or identifier
modelIdnumberPlatform model ID (custom models)
modelVersionIdnumberSpecific model version (latest if omitted)
apiKeysobjectOptional API keys

modelType

Supported model providers.
ValueDescription
CUSTOM_CLOUD_STORAGECustom model stored in platform storage
ANTHROPIC_APIAnthropic API model
OPENAI_APIOpenAI API model
GEMINI_APIGoogle Gemini API model
VERTEX_APIGoogle Vertex AI model
OUMI_APIOumi API model

apiKeys

Optional API credentials if not using platform credentials.
FieldTypeDescription
anthropicstringAnthropic API key
openaistringOpenAI API key
googleGeministringGoogle Gemini API key
googleVertexstringGoogle Vertex AI API key

evaluators *

List of judge evaluators used during evaluation.
FieldTypeRequiredDescription
evaluatorIdnumberID of the evaluator
evaluatorVersionnumberEvaluator version (latest if omitted)

Example

{
  "evaluators": [
    {
      "evaluatorId": 10,
      "evaluatorVersion": 2
    }
  ]
}

inferenceConfig *

Defines inference parameters for the evaluated model.
FieldTypeRequiredDescription
inferenceTemperaturenumberSampling temperature
inferenceMaxNewTokensnumberMaximum tokens generated
inferenceSeednumberRandom seed for reproducibility
requestsPerMinutenumberAPI rate limit

dataset *

Defines the dataset used for evaluation.
FieldTypeRequiredDescription
datasetIdnumberDataset identifier
datasetVersionnumberDataset version (latest if omitted)

generateFailureModes *

Controls failure analysis generation.
ValueDescription
trueAnalyze and categorize model failures
falseSkip failure mode analysis
Failure modes may include:
  • hallucination
  • incorrect reasoning
  • missing information
  • formatting errors

Complete example

{
  "recipe": {
    "recipeConfig": {
      "type": "evaluate",
      "evaluationConfig": {
        "evaluationType": "single_model",
        "modelIdentifier": {
          "modelType": "OPENAI_API",
          "modelName": "gpt-4.1",
          "apiKeys": {
            "openai": "sk-xxxxxxxx"
          }
        },
        "evaluators": [
          {
            "evaluatorId": 12
          }
        ],
        "inferenceConfig": {
          "inferenceTemperature": 0.2,
          "inferenceMaxNewTokens": 512
        },
        "dataset": {
          "datasetId": 42
        },
        "generateFailureModes": true
      }
    }
  }
}

Validation rules

  • recipeConfig.type must equal “evaluate”
  • evaluationType must equal “single_model”
  • modelName is required
  • modelType must match one of the supported providers
  • datasetId must reference a valid dataset
  • At least one evaluator must be provided
  • generateFailureModes must be a boolean