Analysis Configuration

Analysis Configuration#

TypedAnalyzeConfig controls how Oumi analyzes datasets. See Dataset Analysis for usage examples.

Core Settings#

Parameter	Type	Required	Default	Description
`dataset_name`	`str`	Conditional	`None`	HuggingFace dataset name
`dataset_path`	`str`	Conditional	`None`	Path to local JSONL file
`split`	`str`	No	`"train"`	Dataset split to analyze
`subset`	`str`	No	`None`	Dataset subset/config name
`sample_count`	`int`	No	`None`	Max samples to analyze (`None` = all)
`output_path`	`str`	No	`"."`	Directory for output files

Provide either dataset_name (HuggingFace Hub) or dataset_path (local JSONL file):

Both sources must already be in Oumi conversation format (each row / line: {"messages": [{"role": "...", "content": "..."}]}). Rows that fail to parse are skipped with a warning.

HuggingFace Dataset

dataset_name: <org>/<repo>
split: train
sample_count: 1000

Local File

dataset_path: /path/to/data.jsonl

Analyzers#

Configure analyzers as a list with type, an optional id, display_name, and params:

analyzers:
  - type: length
    display_name: Length
    params:
      tokenizer_name: cl100k_base
  - type: quality
    display_name: Quality
  - type: turn_stats
    display_name: TurnStats

Field	Type	Required	Default	Description
`type`	`str`	Yes	—	Analyzer identifier (e.g., `length`, `quality`, `turn_stats`)
`id`	`str`	No	Same as `display_name`	Stable identifier. Canonical key for results, caches, and test metric paths.
`display_name`	`str`	No	Same as `type`	Human-readable label shown in reports and logs
`params`	`dict`	No	`{}`	Analyzer-specific parameters

Test metric paths use id, so they look like "{id}.{field_name}". When id is omitted it defaults to display_name, so YAMLs that only set display_name can still write metric paths like Length.total_tokens.

Note

Each analyzer must have a unique id. display_name may repeat — two analyzers can share a label as long as their ids differ. The legacy key instance_id is accepted as an alias for display_name with a deprecation warning.

`length` Analyzer Parameters#

Parameter	Type	Default	Description
`tokenizer_name`	`str`	`cl100k_base`	Tokenizer name (tiktoken encoding or HuggingFace model ID)

Tiktoken encodings: cl100k_base (GPT-4), p50k_base, o200k_base. HuggingFace models: any valid model ID (e.g., meta-llama/Llama-3.1-8B-Instruct).

`quality` Analyzer Parameters#

The quality analyzer has no configurable parameters. It always checks for non-alternating turns, missing user messages, misplaced system messages, empty messages, and invalid serialized values.

`turn_stats` Analyzer Parameters#

The turn_stats analyzer has no configurable parameters. It computes turn counts by role, system message presence, and first/last turn roles.

Tests#

Tests validate analysis results against configurable thresholds. Metrics are referenced as "{id}.{field_name}" (which equals "{display_name}.{field_name}" when id is omitted).

tests:
  - id: max_tokens
    type: threshold
    metric: Length.total_tokens
    operator: ">"
    value: 10000
    max_percentage: 5.0
    severity: high
    title: "Token count exceeds limit"

Common Fields#

Field	Type	Required	Default	Description
`id`	`str`	Yes	—	Unique test identifier
`type`	`str`	Yes	—	Test type (`threshold`)
`metric`	`str`	Yes	—	Metric path (e.g., `Length.total_tokens`)
`severity`	`str`	No	`medium`	Failure severity: `high`, `medium`, or `low`
`title`	`str`	No	`""`	Human-readable title shown in results
`description`	`str`	No	`""`	Description of what the test checks

Threshold Tests#

Check if a metric exceeds a threshold across the dataset.

Field	Type	Description
`operator`	`str`	Comparison: `<`, `>`, `<=`, `>=`, `==`, `!=`
`value`	`number`	Value to compare against
`max_percentage`	`float`	At most this % of samples can match the condition
`min_percentage`	`float`	At least this % of samples must match the condition

# At most 5% of conversations can have > 10K tokens
- id: max_tokens
  type: threshold
  metric: Length.total_tokens
  operator: ">"
  value: 10000
  max_percentage: 5.0

# At most 5% of conversations can have non-alternating turns
- id: non_alternating
  type: threshold
  metric: Quality.has_non_alternating_turns
  operator: "=="
  value: true
  max_percentage: 5.0

Output Settings#

Parameter	Type	Default	Description
`output_path`	`str`	`"."`	Directory for output files
`generate_report`	`bool`	`false`	Generate HTML report
`report_title`	`str`	`None`	Custom title for the report

Override the output directory via CLI:

oumi analyze --config config.yaml --output /custom/path --format parquet

Complete Example#

dataset_path: /path/to/data.jsonl
sample_count: 1000
output_path: ./analysis_output

analyzers:
  - type: length
    display_name: Length
    params:
      tokenizer_name: cl100k_base
  - type: quality
    display_name: Quality
  - type: turn_stats
    display_name: TurnStats

tests:
  - id: max_tokens
    type: threshold
    metric: Length.total_tokens
    operator: ">"
    value: 10000
    max_percentage: 5.0
    severity: high
    title: "Token count exceeds 10K"
  - id: empty_turns
    type: threshold
    metric: Quality.has_empty_turns
    operator: "=="
    value: true
    max_percentage: 5.0
    severity: high
    title: "Conversations with empty turns"

Analysis Configuration

Contents

Analysis Configuration#

Core Settings#

Analyzers#

`length` Analyzer Parameters#

`quality` Analyzer Parameters#

`turn_stats` Analyzer Parameters#

Tests#

Common Fields#

Threshold Tests#

Output Settings#

Complete Example#

See Also#

Analysis Configuration

Contents

Analysis Configuration#

Core Settings#

Analyzers#

length Analyzer Parameters#

quality Analyzer Parameters#

turn_stats Analyzer Parameters#

Tests#

Common Fields#

Threshold Tests#

Output Settings#

Complete Example#

See Also#

`length` Analyzer Parameters#

`quality` Analyzer Parameters#

`turn_stats` Analyzer Parameters#