Dataset Analysis#

Oumi’s dataset analysis framework helps you understand you datasets. Compute metrics, identify quality issues and validate data with configurable tests.

Key capabilities:

  • Profile datasets: Token counts, length distributions, turn statistics

  • Quality control: Turn alternation, empty messages, invalid values

  • Validate data: Configurable threshold tests with percentage tolerances

  • Export results: CSV, JSON, or Parquet output with statistical summaries

Quick Start#

oumi analyze --config configs/examples/analyze/analyze.yaml
from oumi.analyze import TypedAnalyzeConfig, AnalyzerConfig
from oumi.cli.analyze import run_typed_analysis

config = TypedAnalyzeConfig(
    dataset_path="data/dataset_examples/oumi_format.jsonl",
    analyzers=[
        AnalyzerConfig(type="length", display_name="Length"),
        AnalyzerConfig(type="quality", display_name="Quality"),
    ],
)

results = run_typed_analysis(config)

Results are saved to the output directory (default: current directory) including per-conversation metrics, test results, and statistical summaries. Set it via --output on the CLI or output_path in the YAML config.

Tip

You can use -c as a shorthand for --config in all CLI examples.

Configuration#

A minimal YAML configuration:

dataset_path: data/dataset_examples/oumi_format.jsonl

analyzers:
  - type: length
    display_name: Length
    params:
      tokenizer_name: cl100k_base

For complete configuration options including tests, custom metrics, and tokenizer settings, see Analysis Configuration.

Available Analyzers#

Length Analyzer (length)#

Computes token and message count metrics using a configurable tokenizer.

Metric

Description

total_tokens

Total tokens across all messages

avg_tokens_per_message

Average tokens per message

num_messages

Number of messages in the conversation

user_total_tokens

Total tokens in user messages

assistant_total_tokens

Total tokens in assistant messages

system_total_tokens

Total tokens in system messages

Tip

Configure the tokenizer via params.tokenizer_name. Supports tiktoken encodings (e.g., cl100k_base) and HuggingFace model IDs (e.g., meta-llama/Llama-3.1-8B-Instruct).

Quality Analyzer (quality)#

Fast, non-LLM quality checks for data validation.

Metric

Description

has_non_alternating_turns

Consecutive same-role messages exist (excluding system)

has_no_user_message

Conversation contains no user message

has_system_message_not_at_start

System message appears after position 0

has_empty_turns

Any message has empty or whitespace-only content

empty_turn_count

Number of empty/whitespace-only messages

has_invalid_values

Contains serialized NaN, null, None, undefined

invalid_value_patterns

List of invalid value patterns found

Turn Stats Analyzer (turn_stats)#

Conversation structure and turn count metrics.

Metric

Description

num_turns

Total number of turns (messages)

num_user_turns

Number of user turns

num_assistant_turns

Number of assistant turns

num_tool_turns

Number of tool turns

has_system_message

Whether the conversation has a system message

first_turn_role

Role of the first message

last_turn_role

Role of the last message

Use oumi analyze --list-metrics to see all available metrics and their descriptions.

Working with Results#

Output Files#

File

Description

analysis.{format}

Per-conversation metrics (one row per conversation)

test_results.json

Test pass/fail details (if tests configured)

summary.json

Statistical summary (mean, std, min, max)

Exporting#

# Export to CSV (default)

oumi analyze --config config.yaml

# Export to JSON

oumi analyze --config config.yaml --format json

# Export to Parquet

oumi analyze --config config.yaml --format parquet

# Override output directory

oumi analyze --config config.yaml --output ./my_results

Programmatic Access#

from oumi.analyze import TypedAnalyzeConfig
from oumi.cli.analyze import run_typed_analysis

config = TypedAnalyzeConfig.from_yaml("config.yaml")
output = run_typed_analysis(config)

# Analyzer results keyed by id (defaults to display_name)
for length_result in output["results"]["Length"]:
    print(f"Tokens: {length_result.total_tokens}")

# Pre-built DataFrame (one row per conversation)
df = output["dataframe"]
print(df.describe())

# Test summary (if tests were configured)
if output["test_summary"]:
    summary = output["test_summary"]
    print(f"{summary.passed_tests}/{summary.total_tests} passed")

Analyzing HuggingFace Datasets#

Analyze any HuggingFace Hub dataset directly:

Rows must already be in Oumi conversation format (each row: {"messages": [{"role": "...", "content": "..."}]}). Rows that don’t parse are skipped with a warning. To analyze instruction-style datasets (e.g. prompt/response fields), pre-convert them to Oumi JSONL first and use dataset_path.

# hf_analyze.yaml

dataset_name: <org>/<repo>
split: train
sample_count: 100
output_path: ./analysis_output

analyzers:
  - type: length
    display_name: Length
    params:
      tokenizer_name: cl100k_base
  - type: quality
    display_name: Quality
from oumi.analyze import TypedAnalyzeConfig, AnalyzerConfig
from oumi.cli.analyze import run_typed_analysis

config = TypedAnalyzeConfig(
    dataset_name="<org>/<repo>",
    split="train",
    sample_count=100,
    analyzers=[
        AnalyzerConfig(type="length", display_name="Length"),
        AnalyzerConfig(type="quality", display_name="Quality"),
    ],
)
results = run_typed_analysis(config)

Data Validation with Tests#

Configure tests to automatically validate your dataset against quality thresholds:

analyzers:
  - type: length
    display_name: Length
  - type: quality
    display_name: Quality

tests:
  - id: max_tokens
    type: threshold
    metric: Length.total_tokens
    operator: ">"
    value: 10000
    max_percentage: 5.0
    severity: high
    title: "Token count exceeds 10K"

  - id: no_empty_turns
    type: threshold
    metric: Quality.has_empty_turns
    operator: "=="
    value: true
    max_percentage: 5.0
    severity: high
    title: "Conversations with empty turns"

Metrics are referenced as "{id}.{field_name}" (e.g., Length.total_tokens, Quality.has_empty_turns). When id is omitted it defaults to display_name.

See Analysis Configuration for full test configuration options.

Writing Custom Analyzers#

Create a custom analyzer by subclassing one of the base classes and registering it:

from pydantic import BaseModel, Field
from oumi.analyze.base import ConversationAnalyzer
from oumi.core.registry import register_sample_analyzer
from oumi.core.types.conversation import Conversation


class QuestionMetrics(BaseModel):
    num_questions: int = Field(description="Count of '?' in all messages")
    density: float = Field(description="Questions per message")


@register_sample_analyzer("questions")
class QuestionAnalyzer(ConversationAnalyzer[QuestionMetrics]):
    def analyze(self, conversation: Conversation) -> QuestionMetrics:
        total = sum(
            m.content.count("?")
            for m in conversation.messages
            if isinstance(m.content, str)
        )
        return QuestionMetrics(
            num_questions=total,
            density=total / max(len(conversation.messages), 1),
        )

Then reference it in YAML the same way as built-ins:

analyzers:
  - type: questions
    display_name: Questions

Base classes for different scopes:

Base Class

Scope

analyze() Input

MessageAnalyzer

Per message

Message

ConversationAnalyzer

Per conversation

Conversation

DatasetAnalyzer

Entire dataset

list[Conversation]

PreferenceAnalyzer

Preference pairs

(Conversation, Conversation)

API Reference#