Analysis Configuration#

TypedAnalyzeConfig controls how Oumi analyzes datasets. See Dataset Analysis for usage examples.

Core Settings#

Parameter

Type

Required

Default

Description

dataset_name

str

Conditional

None

HuggingFace dataset name

dataset_path

str

Conditional

None

Path to local JSONL file

split

str

No

"train"

Dataset split to analyze

subset

str

No

None

Dataset subset/config name

sample_count

int

No

None

Max samples to analyze (None = all)

output_path

str

No

"."

Directory for output files

Provide either dataset_name (HuggingFace Hub) or dataset_path (local JSONL file):

Both sources must already be in Oumi conversation format (each row / line: {"messages": [{"role": "...", "content": "..."}]}). Rows that fail to parse are skipped with a warning.

dataset_name: <org>/<repo>
split: train
sample_count: 1000
dataset_path: /path/to/data.jsonl

Analyzers#

Configure analyzers as a list with type, an optional id, display_name, and params:

analyzers:
  - type: length
    display_name: Length
    params:
      tokenizer_name: cl100k_base
  - type: quality
    display_name: Quality
  - type: turn_stats
    display_name: TurnStats

Field

Type

Required

Default

Description

type

str

Yes

Analyzer identifier (e.g., length, quality, turn_stats)

id

str

No

Same as display_name

Stable identifier. Canonical key for results, caches, and test metric paths.

display_name

str

No

Same as type

Human-readable label shown in reports and logs

params

dict

No

{}

Analyzer-specific parameters

Test metric paths use id, so they look like "{id}.{field_name}". When id is omitted it defaults to display_name, so YAMLs that only set display_name can still write metric paths like Length.total_tokens.

Note

Each analyzer must have a unique id. display_name may repeat — two analyzers can share a label as long as their ids differ. The legacy key instance_id is accepted as an alias for display_name with a deprecation warning.

length Analyzer Parameters#

Parameter

Type

Default

Description

tokenizer_name

str

cl100k_base

Tokenizer name (tiktoken encoding or HuggingFace model ID)

Tiktoken encodings: cl100k_base (GPT-4), p50k_base, o200k_base. HuggingFace models: any valid model ID (e.g., meta-llama/Llama-3.1-8B-Instruct).

quality Analyzer Parameters#

The quality analyzer has no configurable parameters. It always checks for non-alternating turns, missing user messages, misplaced system messages, empty messages, and invalid serialized values.

turn_stats Analyzer Parameters#

The turn_stats analyzer has no configurable parameters. It computes turn counts by role, system message presence, and first/last turn roles.

Tests#

Tests validate analysis results against configurable thresholds. Metrics are referenced as "{id}.{field_name}" (which equals "{display_name}.{field_name}" when id is omitted).

tests:
  - id: max_tokens
    type: threshold
    metric: Length.total_tokens
    operator: ">"
    value: 10000
    max_percentage: 5.0
    severity: high
    title: "Token count exceeds limit"

Common Fields#

Field

Type

Required

Default

Description

id

str

Yes

Unique test identifier

type

str

Yes

Test type (threshold)

metric

str

Yes

Metric path (e.g., Length.total_tokens)

severity

str

No

medium

Failure severity: high, medium, or low

title

str

No

""

Human-readable title shown in results

description

str

No

""

Description of what the test checks

Threshold Tests#

Check if a metric exceeds a threshold across the dataset.

Field

Type

Description

operator

str

Comparison: <, >, <=, >=, ==, !=

value

number

Value to compare against

max_percentage

float

At most this % of samples can match the condition

min_percentage

float

At least this % of samples must match the condition

# At most 5% of conversations can have > 10K tokens
- id: max_tokens
  type: threshold
  metric: Length.total_tokens
  operator: ">"
  value: 10000
  max_percentage: 5.0

# At most 5% of conversations can have non-alternating turns
- id: non_alternating
  type: threshold
  metric: Quality.has_non_alternating_turns
  operator: "=="
  value: true
  max_percentage: 5.0

Output Settings#

Parameter

Type

Default

Description

output_path

str

"."

Directory for output files

generate_report

bool

false

Generate HTML report

report_title

str

None

Custom title for the report

Override the output directory via CLI:

oumi analyze --config config.yaml --output /custom/path --format parquet

Complete Example#

dataset_path: /path/to/data.jsonl
sample_count: 1000
output_path: ./analysis_output

analyzers:
  - type: length
    display_name: Length
    params:
      tokenizer_name: cl100k_base
  - type: quality
    display_name: Quality
  - type: turn_stats
    display_name: TurnStats

tests:
  - id: max_tokens
    type: threshold
    metric: Length.total_tokens
    operator: ">"
    value: 10000
    max_percentage: 5.0
    severity: high
    title: "Token count exceeds 10K"
  - id: empty_turns
    type: threshold
    metric: Quality.has_empty_turns
    operator: "=="
    value: true
    max_percentage: 5.0
    severity: high
    title: "Conversations with empty turns"

See Also#