Skip to main content

Overview

The following describes the configuration schema used to create evaluators in Oumi, including:
  • Metadata about the evaluator
  • Evaluation parameters
  • Model configuration
  • Scoring and explanation settings
Fields marked with (*) are required.

Schema structure

{
  "displayName": "",
  "description": "",
  "params": {},
  "modelIdentifier": {},
  "inferenceConfig": {},
  "responseFilterMode": "",
  "generateScoreExplanation": true
}

displayName *

Human-readable name of the evaluator.

Properties

FieldTypeRequiredDescription
displayNamestringName used to identify the evaluator
descriptionstringOptional description of the evaluator

Constraints

FieldConstraint
displayNameMinimum length: 1

Example

{
  "displayName": "Answer Quality Evaluator",
  "description": "Evaluates model responses using a judge model."
}

params *

Defines the evaluation parameters and scoring behavior.

Properties

FieldTypeRequiredDescription
evaluatorTypestringType of evaluator
promptstringPrompt used to guide the evaluation model
judgmentScoresobjectDefinition of scoring categories
dataFieldsobjectDefines input fields used by the evaluator
isMultiturnbooleanIndicates whether evaluation is multi-turn

evaluatorType

Defines the evaluator mechanism.
Allowed Values
"judge"

prompt *

Prompt provided to the evaluator model to guide scoring behavior.

Constraints

FieldConstraint
promptMinimum length: 1

Example

{
  "params": {
    "evaluatorType": "judge",
    "prompt": "Evaluate whether the assistant's response is accurate and helpful."
  }
}

judgmentScores

Defines the scoring structure used by the evaluator. This object typically contains the set of evaluation metrics or labels the evaluator should output. Example structure:
{
  "judgmentScores": {
    "accuracy": {},
    "helpfulness": {},
    "safety": {}
  }
}

dataFields

Defines the dataset fields that are used by the evaluator. Example structure:
{
  "dataFields": {
    "input": "user_question",
    "response": "model_answer",
    "reference": "ground_truth"
  }
}

isMultiturn

Indicates whether the evaluator processes multi-turn conversations.
ValueMeaning
trueEvaluates conversation history
falseEvaluates single-turn responses

modelIdentifier *

Defines the model used by the evaluator. This object follows the Model Identifier schema. Example:
{
  "modelIdentifier": {
    "modelType": "llm",
    "modelName": "Judge Model",
    "modelId": "judge_model",
    "modelVersionId": "v1"
  }
}

inferenceConfig *

Defines runtime inference behavior for the evaluator model. Example:
{
  "inferenceConfig": {
    "inferenceTemperature": 0.0,
    "inferenceMaxNewTokens": 256
  }
}

responseFilterMode

Controls which parts of the model output are used for evaluation.
ValueDescription
THINKING_AND_RESPONSEIncludes both reasoning and final response
RESPONSE_ONLYUses only the final response
THINKING_ONLYUses only the reasoning output

generateScoreExplanation *

Determines whether the evaluator should generate an explanation for the score.
ValueDescription
trueReturns a textual explanation of the score
falseReturns only the score

Complete example

{
  "displayName": "Answer Quality Evaluator",
  "description": "Evaluates responses using a judge model",
  "params": {
    "evaluatorType": "judge",
    "prompt": "Score the assistant response for accuracy and helpfulness.",
    "isMultiturn": false,
    "judgmentScores": {
      "accuracy": {},
      "helpfulness": {}
    },
    "dataFields": {
      "input": "question",
      "response": "answer"
    }
  },
  "modelIdentifier": {
    "modelType": "