Analysis Configuration#
TypedAnalyzeConfig controls how Oumi analyzes datasets. See Dataset Analysis for usage examples.
Core Settings#
Parameter |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
|
Conditional |
|
HuggingFace dataset name |
|
|
Conditional |
|
Path to local JSONL file |
|
|
No |
|
Dataset split to analyze |
|
|
No |
|
Dataset subset/config name |
|
|
No |
|
Max samples to analyze ( |
|
|
No |
|
Directory for output files |
Provide either dataset_name (HuggingFace Hub) or dataset_path (local JSONL file):
Both sources must already be in Oumi conversation format (each row / line: {"messages": [{"role": "...", "content": "..."}]}). Rows that fail to parse are skipped with a warning.
dataset_name: <org>/<repo>
split: train
sample_count: 1000
dataset_path: /path/to/data.jsonl
Analyzers#
Configure analyzers as a list with type, an optional id, display_name, and params:
analyzers:
- type: length
display_name: Length
params:
tokenizer_name: cl100k_base
- type: quality
display_name: Quality
- type: turn_stats
display_name: TurnStats
Field |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
|
Yes |
— |
Analyzer identifier (e.g., |
|
|
No |
Same as |
Stable identifier. Canonical key for results, caches, and test metric paths. |
|
|
No |
Same as |
Human-readable label shown in reports and logs |
|
|
No |
|
Analyzer-specific parameters |
Test metric paths use id, so they look like "{id}.{field_name}". When id is omitted it defaults to display_name, so YAMLs that only set display_name can still write metric paths like Length.total_tokens.
Note
Each analyzer must have a unique id. display_name may repeat — two analyzers can share a label as long as their ids differ. The legacy key instance_id is accepted as an alias for display_name with a deprecation warning.
length Analyzer Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Tokenizer name (tiktoken encoding or HuggingFace model ID) |
Tiktoken encodings: cl100k_base (GPT-4), p50k_base, o200k_base.
HuggingFace models: any valid model ID (e.g., meta-llama/Llama-3.1-8B-Instruct).
quality Analyzer Parameters#
The quality analyzer has no configurable parameters. It always checks for non-alternating turns, missing user messages, misplaced system messages, empty messages, and invalid serialized values.
turn_stats Analyzer Parameters#
The turn_stats analyzer has no configurable parameters. It computes turn counts by role, system message presence, and first/last turn roles.
Tests#
Tests validate analysis results against configurable thresholds. Metrics are referenced as "{id}.{field_name}" (which equals "{display_name}.{field_name}" when id is omitted).
tests:
- id: max_tokens
type: threshold
metric: Length.total_tokens
operator: ">"
value: 10000
max_percentage: 5.0
severity: high
title: "Token count exceeds limit"
Common Fields#
Field |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
|
Yes |
— |
Unique test identifier |
|
|
Yes |
— |
Test type ( |
|
|
Yes |
— |
Metric path (e.g., |
|
|
No |
|
Failure severity: |
|
|
No |
|
Human-readable title shown in results |
|
|
No |
|
Description of what the test checks |
Threshold Tests#
Check if a metric exceeds a threshold across the dataset.
Field |
Type |
Description |
|---|---|---|
|
|
Comparison: |
|
|
Value to compare against |
|
|
At most this % of samples can match the condition |
|
|
At least this % of samples must match the condition |
# At most 5% of conversations can have > 10K tokens
- id: max_tokens
type: threshold
metric: Length.total_tokens
operator: ">"
value: 10000
max_percentage: 5.0
# At most 5% of conversations can have non-alternating turns
- id: non_alternating
type: threshold
metric: Quality.has_non_alternating_turns
operator: "=="
value: true
max_percentage: 5.0
Output Settings#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Directory for output files |
|
|
|
Generate HTML report |
|
|
|
Custom title for the report |
Override the output directory via CLI:
oumi analyze --config config.yaml --output /custom/path --format parquet
Complete Example#
dataset_path: /path/to/data.jsonl
sample_count: 1000
output_path: ./analysis_output
analyzers:
- type: length
display_name: Length
params:
tokenizer_name: cl100k_base
- type: quality
display_name: Quality
- type: turn_stats
display_name: TurnStats
tests:
- id: max_tokens
type: threshold
metric: Length.total_tokens
operator: ">"
value: 10000
max_percentage: 5.0
severity: high
title: "Token count exceeds 10K"
- id: empty_turns
type: threshold
metric: Quality.has_empty_turns
operator: "=="
value: true
max_percentage: 5.0
severity: high
title: "Conversations with empty turns"
See Also#
Dataset Analysis - Main analysis guide
TypedAnalyzeConfig- Configuration API referenceAnalyzerConfig- Analyzer configuration