Dataset Analysis#
Oumi’s dataset analysis framework helps you understand you datasets. Compute metrics, identify quality issues and validate data with configurable tests.
Key capabilities:
Profile datasets: Token counts, length distributions, turn statistics
Quality control: Turn alternation, empty messages, invalid values
Validate data: Configurable threshold tests with percentage tolerances
Export results: CSV, JSON, or Parquet output with statistical summaries
Quick Start#
oumi analyze --config configs/examples/analyze/analyze.yaml
from oumi.analyze import TypedAnalyzeConfig, AnalyzerConfig
from oumi.cli.analyze import run_typed_analysis
config = TypedAnalyzeConfig(
dataset_path="data/dataset_examples/oumi_format.jsonl",
analyzers=[
AnalyzerConfig(type="length", display_name="Length"),
AnalyzerConfig(type="quality", display_name="Quality"),
],
)
results = run_typed_analysis(config)
Results are saved to the output directory (default: current directory) including per-conversation metrics, test results, and statistical summaries. Set it via --output on the CLI or output_path in the YAML config.
Tip
You can use -c as a shorthand for --config in all CLI examples.
Configuration#
A minimal YAML configuration:
dataset_path: data/dataset_examples/oumi_format.jsonl
analyzers:
- type: length
display_name: Length
params:
tokenizer_name: cl100k_base
For complete configuration options including tests, custom metrics, and tokenizer settings, see Analysis Configuration.
Available Analyzers#
Length Analyzer (length)#
Computes token and message count metrics using a configurable tokenizer.
Metric |
Description |
|---|---|
|
Total tokens across all messages |
|
Average tokens per message |
|
Number of messages in the conversation |
|
Total tokens in user messages |
|
Total tokens in assistant messages |
|
Total tokens in system messages |
Tip
Configure the tokenizer via params.tokenizer_name. Supports tiktoken encodings (e.g., cl100k_base) and HuggingFace model IDs (e.g., meta-llama/Llama-3.1-8B-Instruct).
Quality Analyzer (quality)#
Fast, non-LLM quality checks for data validation.
Metric |
Description |
|---|---|
|
Consecutive same-role messages exist (excluding system) |
|
Conversation contains no user message |
|
System message appears after position 0 |
|
Any message has empty or whitespace-only content |
|
Number of empty/whitespace-only messages |
|
Contains serialized |
|
List of invalid value patterns found |
Turn Stats Analyzer (turn_stats)#
Conversation structure and turn count metrics.
Metric |
Description |
|---|---|
|
Total number of turns (messages) |
|
Number of user turns |
|
Number of assistant turns |
|
Number of tool turns |
|
Whether the conversation has a system message |
|
Role of the first message |
|
Role of the last message |
Use oumi analyze --list-metrics to see all available metrics and their descriptions.
Working with Results#
Output Files#
File |
Description |
|---|---|
|
Per-conversation metrics (one row per conversation) |
|
Test pass/fail details (if tests configured) |
|
Statistical summary (mean, std, min, max) |
Exporting#
# Export to CSV (default)
oumi analyze --config config.yaml
# Export to JSON
oumi analyze --config config.yaml --format json
# Export to Parquet
oumi analyze --config config.yaml --format parquet
# Override output directory
oumi analyze --config config.yaml --output ./my_results
Programmatic Access#
from oumi.analyze import TypedAnalyzeConfig
from oumi.cli.analyze import run_typed_analysis
config = TypedAnalyzeConfig.from_yaml("config.yaml")
output = run_typed_analysis(config)
# Analyzer results keyed by id (defaults to display_name)
for length_result in output["results"]["Length"]:
print(f"Tokens: {length_result.total_tokens}")
# Pre-built DataFrame (one row per conversation)
df = output["dataframe"]
print(df.describe())
# Test summary (if tests were configured)
if output["test_summary"]:
summary = output["test_summary"]
print(f"{summary.passed_tests}/{summary.total_tests} passed")
Analyzing HuggingFace Datasets#
Analyze any HuggingFace Hub dataset directly:
Rows must already be in Oumi conversation format
(each row: {"messages": [{"role": "...", "content": "..."}]}). Rows that
don’t parse are skipped with a warning. To analyze instruction-style datasets
(e.g. prompt/response fields), pre-convert them to Oumi JSONL first and
use dataset_path.
# hf_analyze.yaml
dataset_name: <org>/<repo>
split: train
sample_count: 100
output_path: ./analysis_output
analyzers:
- type: length
display_name: Length
params:
tokenizer_name: cl100k_base
- type: quality
display_name: Quality
from oumi.analyze import TypedAnalyzeConfig, AnalyzerConfig
from oumi.cli.analyze import run_typed_analysis
config = TypedAnalyzeConfig(
dataset_name="<org>/<repo>",
split="train",
sample_count=100,
analyzers=[
AnalyzerConfig(type="length", display_name="Length"),
AnalyzerConfig(type="quality", display_name="Quality"),
],
)
results = run_typed_analysis(config)
Data Validation with Tests#
Configure tests to automatically validate your dataset against quality thresholds:
analyzers:
- type: length
display_name: Length
- type: quality
display_name: Quality
tests:
- id: max_tokens
type: threshold
metric: Length.total_tokens
operator: ">"
value: 10000
max_percentage: 5.0
severity: high
title: "Token count exceeds 10K"
- id: no_empty_turns
type: threshold
metric: Quality.has_empty_turns
operator: "=="
value: true
max_percentage: 5.0
severity: high
title: "Conversations with empty turns"
Metrics are referenced as "{id}.{field_name}" (e.g., Length.total_tokens, Quality.has_empty_turns). When id is omitted it defaults to display_name.
See Analysis Configuration for full test configuration options.
Writing Custom Analyzers#
Create a custom analyzer by subclassing one of the base classes and registering it:
from pydantic import BaseModel, Field
from oumi.analyze.base import ConversationAnalyzer
from oumi.core.registry import register_sample_analyzer
from oumi.core.types.conversation import Conversation
class QuestionMetrics(BaseModel):
num_questions: int = Field(description="Count of '?' in all messages")
density: float = Field(description="Questions per message")
@register_sample_analyzer("questions")
class QuestionAnalyzer(ConversationAnalyzer[QuestionMetrics]):
def analyze(self, conversation: Conversation) -> QuestionMetrics:
total = sum(
m.content.count("?")
for m in conversation.messages
if isinstance(m.content, str)
)
return QuestionMetrics(
num_questions=total,
density=total / max(len(conversation.messages), 1),
)
Then reference it in YAML the same way as built-ins:
analyzers:
- type: questions
display_name: Questions
Base classes for different scopes:
Base Class |
Scope |
|
|---|---|---|
Per message |
|
|
Per conversation |
|
|
Entire dataset |
|
|
Preference pairs |
|
API Reference#
TypedAnalyzeConfig- Configuration classAnalyzerConfig- Analyzer configurationAnalysisPipeline- Analysis pipelineConversationAnalyzer- Base class for conversation-level analyzersMessageAnalyzer- Base class for message-level analyzersDatasetAnalyzer- Base class for dataset-level analyzersLengthAnalyzer- Length metricsDataQualityAnalyzer- Quality checksTurnStatsAnalyzer- Turn statisticsTestEngine- Test engine (in-memory)BatchTestEngine- Incremental test engine