Data Synthesis

Data Synthesis#

The oumi synth command enables you to generate synthetic datasets using large language models. Instead of manually creating training data, you can define rules and templates that automatically generate diverse, high-quality examples.

What You Can Build#

Question-Answer datasets for training chatbots
Instruction-following datasets with varied complexity levels
Domain-specific training data (legal, medical, technical)
Conversation datasets with different personas or styles
Data augmentation to expand existing small datasets

How It Works#

The synthesis process follows three steps:

Define attributes - What varies in your data (topic, difficulty, style, etc.)
Create templates - How the AI should generate content using those attributes
Generate samples - The system creates many examples by combining different attribute values

Your First Synthesis#

Let’s create a simple question-answer dataset. Save this as my_first_synth.yaml:

# Generate 10 geography questions
strategy: GENERAL
num_samples: 10
output_path: geography_qa.jsonl

strategy_params:
  # Give the AI an example to learn from
  input_examples:
    - examples:
      - example_question: "What is the capital of France?"

  # Define what should vary across examples
  sampled_attributes:
    - id: difficulty
      name: Difficulty Level
      description: How challenging the question should be
      possible_values:
        - id: easy
          name: Easy
          description: Basic facts everyone should know
        - id: hard
          name: Hard
          description: Detailed knowledge for experts

  # Tell the AI how to generate questions and answers
  generated_attributes:
    - id: question
      instruction_messages:
        - role: SYSTEM
          content: "You are a geography teacher creating quiz questions. Example: {example_question}"
        - role: USER
          content: "Create a {difficulty} geography question. Write the question only, not the answer."
    - id: answer
      instruction_messages:
        - role: SYSTEM
          content: "You are a helpful AI assistant."
        - role: USER
          content: "{question}"

# Configure which AI model to use
inference_config:
  model:
    model_name: claude-3-5-sonnet-20240620
  engine: ANTHROPIC

Run it with:

oumi synth -c my_first_synth.yaml

What happens: The system will create 10 geography questions, some easy and some hard, saved to geography_qa.jsonl.

Understanding the Results#

After running synthesis, you’ll see:

A preview table showing the first few generated samples
The total number of samples created
Instructions for using the dataset in training

Each line in the output file contains one example:

{"difficulty": "easy", "question": "What is the largest continent?", "answer": "Asia"}
{"difficulty": "hard", "question": "Which country has the most time zones?", "answer": "France"}

Next Steps: Building More Complex Datasets#

Once you’re comfortable with the basics, you can create more sophisticated datasets:

Adding Multiple Attributes#

Mix and match different properties (topic + difficulty + style):

sampled_attributes:
  - id: topic
    possible_values: [{id: geography}, {id: history}, {id: science}]
  - id: difficulty
    possible_values: [{id: easy}, {id: medium}, {id: hard}]
  - id: style
    possible_values: [{id: formal}, {id: casual}, {id: academic}]

Using Your Own Data#

Feed in existing datasets or documents:

input_data:
  - path: "my_existing_data.jsonl"
input_documents:
  - path: "textbook.pdf"

Supported dataset formats (input_data): JSONL, JSON, CSV, TSV, Parquet, and XLSX. For XLSX files, every sheet is concatenated into a single dataset, so you can keep related tabs in one workbook. Globs are supported:

input_data:
  - path: "data/**/*.xlsx"

Supported document formats (input_documents): .pdf, .txt, .md, .html, and .docx. DOCX files are parsed paragraph-by-paragraph.

Note

XLSX / DOCX parsing require the synthesis extras: pip install oumi[synthesis].

Few-Shot Sampling From Sources#

When you want each synthesised sample to see multiple randomly-drawn items from a source (examples, datasets, or documents), use num_shots. This turns the source into a dynamic few-shot pool instead of round-robin enumeration.

input_examples:
  - id: few_shot_examples
    num_shots: 3                       # draw 3 examples per synthesis sample
    examples:
      - task_type: "summarization"
        example_input: "..."
      - task_type: "translation"
        example_input: "..."
      # ...

generated_attributes:
  - id: instruction
    instruction_messages:
      - role: USER
        content: |
          Example 1: {few_shot_examples[0].example_input}
          Example 2: {few_shot_examples[1].example_input}
          Example 3: {few_shot_examples[2].example_input}
          Now produce a new, different example.

Rules:

num_shots: None or 1 → the source behaves as before (round-robin), reference fields as {id.field}.
num_shots > 1 → bracket notation {id[i].field} is required, and id must be set.
Works uniformly across input_examples, input_data, and input_documents.

A runnable example lives at oumi-ai/oumi.

Creating Conversations#

Build multi-turn dialogues with fixed structure using transformed attributes:

transformed_attributes:
  - id: conversation
    transformation_strategy:
      type: CHAT
      chat_transform:
        messages:
          - role: USER
            content: "{question}"
          - role: ASSISTANT
            content: "{answer}"

Creating Multi-Turn Conversations#

For dynamic, variable-length conversations, use multiturn_attributes. Each turn is generated by the model with full conversation context, producing natural back-and-forth dialogue:

multiturn_attributes:
  - id: support_conversation
    min_turns: 4
    max_turns: 12

    role_instruction_messages:
      USER: |
        You are a customer contacting support.
        Your issue: {issue_detail}
      ASSISTANT: |
        You are a helpful support agent.
        Be professional and resolve the customer's issue.

    output_system_prompt: |
      You are a helpful support agent.

Ready to dive deeper? The sections below cover all available options in detail.

Environment-First Tool Synthesis#

Agentic synthesis now follows an environment-first model. Tools do not declare an output strategy directly. Instead, each tool is bound to an environment, and the environment type defines how tool calls are executed via its step() method.

synthetic environments are backed by an LLM that simulates tool execution. They can be stateless (no persistent state) or stateful (mutable JSON state across turns). Statefulness is controlled by the optional state_params field — when provided, the environment tracks and mutates state across calls; when absent, each call is independent.
deterministic environments behave like lookup tables. Each tool defines a set of input-to-output mappings, and step() resolves tool calls by matching arguments against those mappings. No LLM is involved.

At the config level:

Environments own their tool definitions.
Reusable environment catalogs live in top-level environment_config or environment_config_path.
Tools do not declare an environment field. The parent environment owns the binding.
deterministic_outputs is only used for tools in deterministic environments.
read_only is only meaningful for tools in stateful synthetic environments.
Multiturn attributes reference environments (not individual tools) to select which tools are available.

Example:

environment_config:
  environments:
    - id: support_backend
      name: Support Backend
      description: Simulated support system with tickets and users
      type: synthetic
      system_prompt: You manage a customer support system with tickets and users.
      state_params:
        state_schema:
          type: object
          properties:
            tickets: { type: array }
            users: { type: array }
        initial_state:
          tickets: []
          users: []
      tools:
        - id: get_ticket
          name: GetTicket
          description: Read a ticket from the support backend.
          read_only: true
          parameters:
            type: object
            properties:
              ticket_id: { type: string }
        - id: create_ticket
          name: CreateTicket
          description: Create a new support ticket.
          read_only: false
          parameters:
            type: object
            properties:
              subject: { type: string }
              priority: { type: string, enum: [low, medium, high] }

    - id: faq_lookup
      name: FAQ Lookup
      description: Cached LLM-backed FAQ answers
      type: synthetic
      system_prompt: Generate concise FAQ answers grounded in the tool contract.
      cache_by_input: true
      tools:
        - id: answer_faq
          name: AnswerFAQ
          description: Answer common support questions.
          parameters:
            type: object
            properties:
              question: { type: string }

    - id: policy_table
      name: Policy Table
      description: Predefined policy responses
      type: deterministic
      tools:
        - id: get_refund_policy
          name: GetRefundPolicy
          description: Return the matching refund policy.
          parameters:
            type: object
            properties:
              policy_type: { type: string }
          deterministic_outputs:
            - input:
                policy_type: standard
              output:
                policy: Standard 30-day refund policy

strategy_params:
  multiturn_attributes:
    - id: support_chat
      min_turns: 2
      max_turns: 4
      role_instruction_messages:
        USER: You are a customer contacting support.
        ASSISTANT: You are a helpful support agent.
      available_environments: [support_backend, faq_lookup, policy_table]

Complete Configuration Reference#

Top-Level Parameters#

strategy: The synthesis strategy to use (currently only GENERAL is supported)
num_samples: Number of synthetic samples to generate
output_path: Path where the generated dataset will be saved (must end with .jsonl)
strategy_params: Parameters specific to the synthesis strategy
inference_config: Configuration for the model used in generation

Strategy Parameters#

The strategy_params section defines the core synthesis logic:

Input Sources#

You can provide data from multiple sources:

input_data: Existing datasets to sample from

input_data:
  - path: "hf:dataset_name"  # HuggingFace dataset
    hf_split: train
  - path: "/path/to/local/data.jsonl"  # Local file
    attribute_map:
      old_column_name: new_attribute_name

input_documents: Documents to segment and use in synthesis

input_documents:
  - path: "/path/to/document.pdf"
    id: my_doc
    segmentation_params:
      id: doc_segment
      segment_length: 2048
      segment_overlap: 200

input_examples: Inline examples for few-shot learning

input_examples:
  - examples:
    - attribute1: "value1"
      attribute2: "value2"
    - attribute1: "value3"
      attribute2: "value4"

Attribute Types#

Sampled Attributes: Randomly selected values from predefined options

sampled_attributes:
  - id: difficulty
    name: Difficulty Level
    description: How challenging the question should be
    possible_values:
      - id: easy
        name: Easy
        description: Simple, straightforward questions
        sample_rate: 0.4  # 40% of samples
      - id: medium
        name: Medium
        description: Moderately challenging questions
        sample_rate: 0.4  # 40% of samples
      - id: hard
        name: Hard
        description: Complex, advanced questions
        # No sample_rate specified = 20% (remaining)

Generated Attributes: Created by LLM using instruction messages

generated_attributes:
  - id: summary
    instruction_messages:
      - role: SYSTEM
        content: "You are a helpful summarization assistant."
      - role: USER
        content: "Summarize this text: {input_text}. Format your result as 'Summary: <summary>'"
    postprocessing_params:
      id: clean_summary
      cut_prefix: "Summary: "
      strip_whitespace: true

Multi-Turn Attributes: Dynamic, variable-length conversations generated turn-by-turn

Unlike generated attributes which produce single values, multi-turn attributes generate full conversations where each turn is produced by the model with the complete conversation history as context. The system automatically plans the conversation, then generates each turn sequentially.

multiturn_attributes:
  - id: support_conversation
    min_turns: 4
    max_turns: 12

    role_instruction_messages:
      USER: |
        You are {customer_name}, a {customer_type.name} customer.
        Your issue: {issue_detail}
        Stay in character and respond naturally.
      ASSISTANT: |
        You are a helpful support agent.
        Be professional and work toward resolving the issue.

    output_system_prompt: |
      You are a helpful support agent.

    conversation_planner: |
      Optional custom instructions for the conversation planner.

Key parameters:

id: Unique identifier for the attribute. The generated conversation is stored under this ID, and a conversation plan is automatically stored under {id}_plan.
min_turns / max_turns: Controls the conversation length range. Each turn is one message (user or assistant).
role_instruction_messages: Per-role instruction templates. Must define both USER and ASSISTANT roles. These templates can reference any previously defined attributes using {placeholder} syntax.
output_system_prompt: Optional system prompt prepended to the final output conversation.
conversation_planner: Optional custom instructions for the conversation planner that generates a turn-by-turn plan before the conversation begins.

The output is a conversation object with a list of messages:

{
  "support_conversation": {
    "messages": [
      {"role": "system", "content": "You are a helpful support agent."},
      {"role": "user", "content": "I need help with my order."},
      {"role": "assistant", "content": "I'd be happy to help. What's your order number?"},
      {"role": "user", "content": "It's BT-12345."},
      {"role": "assistant", "content": "Let me look that up for you."}
    ]
  },
  "support_conversation_plan": "Turn 1: Customer opens with issue..."
}

Transformed Attributes: Rule-based transformations of existing attributes

transformed_attributes:
  - id: conversation
    transformation_strategy:
      type: CHAT
      chat_transform:
        messages:
          - role: USER
            content: "{question}"
          - role: ASSISTANT
            content: "{answer}"

Advanced Features#

Combination Sampling: Control probability of specific attribute combinations

combination_sampling:
  - combination:
      difficulty: hard
      topic: science
    sample_rate: 0.1  # 10% of samples will have hard science questions

Passthrough Attributes: Specify which attributes to include in final output

passthrough_attributes:
  - question
  - answer
  - difficulty
  - topic

Attribute Referencing#

In instruction messages and transformations, you can reference attributes using {attribute_id} syntax:

{attribute_id}: The value/name of the attribute
{attribute_id.description}: The description of a sampled attribute value
{attribute_id.parent}: The parent name of a sampled attribute
{attribute_id.parent.description}: The parent description of a sampled attribute

Postprocessing#

Generated attributes can be postprocessed to clean up the output:

postprocessing_params:
  id: cleaned_attribute
  keep_original_text_attribute: true  # Keep original alongside cleaned version
  cut_prefix: "Answer: "  # Remove this prefix and everything before it
  cut_suffix: "\n\n"      # Remove this suffix and everything after it
  regex: "\\*\\*(.+?)\\*\\*"  # Extract content between ** **
  strip_whitespace: true  # Remove leading/trailing whitespace
  added_prefix: "Response: "  # Add this prefix
  added_suffix: "."       # Add this suffix

Transformation Strategies#

For the following examples, let’s assume we have a data sample with the following values.

{
  "question": "What color is the sky?",
  "answer": "The sky is blue."
}

String Transformation#

transformed_attributes:
  - id: example_string_attribute
    transformation_strategy:
      type: STRING
      string_transform: "Question: {question}\nAnswer: {answer}"

Example Result:

{
  "example_string_attribute": "Question: What color is the sky?\nAnswer: The sky is blue."
}

List Transformation#

transformed_attributes:
  - id: example_list_attribute
    transformation_strategy:
      type: LIST
      list_transform:
        - "{question}"
        - "{answer}"

Example Result:

{
  "example_list_attribute": [
    "What color is the sky?",
    "The sky is blue.",
  ]
}

Dictionary Transformation#

transformed_attributes:
  - id: example_dict_attribute
    transformation_strategy:
      type: DICT
      dict_transform:
        question: "{question}"
        answer: "{answer}"

Example Result:

{
  "example_list_attribute": {
    "question": "What color is the sky?",
    "answer": "The sky is blue.",
  }
}

Chat Transformation#

transformed_attributes:
  - id: string_attribute
    transformation_strategy:
      type: CHAT
      chat_transform:
        messages:
          - role: USER
            content: "{question}"
          - role: ASSISTANT
            content: "{answer}"

Document Segmentation#

When using documents, you can segment them for processing:

input_documents:
  - path: "/path/to/document.pdf"
    id: research_paper
    segmentation_params:
      id: paper_segment
      segmentation_strategy: TOKENS
      tokenizer: "openai-community/gpt2"
      segment_length: 1024
      segment_overlap: 128
      keep_original_text: true

Inference Configuration#

Configure the model and generation parameters:

inference_config:
  model:
    model_name: "claude-3-5-sonnet-20240620"
  engine: ANTHROPIC
  generation:
    max_new_tokens: 1024
    temperature: 0.7
    top_p: 0.9
  remote_params:
    num_workers: 5
    politeness_policy: 60  # Delay between requests in seconds

Supported Engines#

ANTHROPIC: Claude models (requires API key)
OPENAI: OpenAI models (requires API key)
VLLM: Local vLLM inference server
NATIVE_TEXT: Local HuggingFace transformers
And many more (see Inference Engines)

Command Line Options#

The oumi synth command supports these options:

--config, -c: Path to synthesis configuration file (required)
--level: Set logging level (DEBUG, INFO, WARNING, ERROR)

You can also use CLI overrides to modify configuration parameters:

oumi synth -c config.yaml \
  --num_samples 50 \
  --inference_config.generation.temperature 0.5 \
  --strategy_params.sampled_attributes[0].possible_values[0].sample_rate 0.8

Output Format#

The synthesized dataset is saved as a JSONL file where each line contains a JSON object with the attributes in the config:

{"difficulty": "easy", "topic": "geography", "question": "What is the capital of France?", "answer": "Paris"}
{"difficulty": "medium", "topic": "history", "question": "When did World War II end?", "answer": "World War II ended in 1945"}

After synthesis completes, you’ll see a preview table and instructions on how to use the generated dataset for training:

Successfully synthesized 100 samples and saved to synthetic_qa_dataset.jsonl

To train a model, run: oumi train -c path/to/your/train/config.yaml

If you included a 'conversation' chat attribute in your config, update the
config to use your new dataset:
data:
  train:
    datasets:
      - dataset_name: "text_sft_jsonl"
        dataset_path: "synthetic_qa_dataset.jsonl"

Batch Inference#

When your synthesis provider supports batch inference (OpenAI, Anthropic, Together, Fireworks, Parasail — see Inference Engines), you can submit all prompts for a single attribute as a batch job rather than calling the API online:

from oumi.core.synthesis.attribute_synthesizer import AttributeSynthesizer

synth = AttributeSynthesizer(config)

# Submit a batch job for one generated attribute
batch_id = synth.synthesize_batch(samples, generated_attribute)

# Later, retrieve results
results = synth.get_batch_results(batch_id, samples, generated_attribute)
# Or tolerate per-row failures:
partial = synth.get_batch_results_partial(batch_id, samples, generated_attribute)

Batches are typically 50% cheaper than online inference at the cost of a 24-hour completion window. Attributes are batched one at a time (not across attributes), so chained generated_attributes still run sequentially.

Token Usage Tracking#

AttributeSynthesizer accumulates token usage across every online and batch call:

print(synth.total_input_tokens)    # prompt_tokens across all calls
print(synth.total_output_tokens)   # completion_tokens
print(synth.total_cached_tokens)   # prompt tokens served from provider cache

Use these counters for cost reporting across an entire synthesis run. See also Token Usage Tracking on the inference engine side.

Best Practices#

Start Small: Begin with a small num_samples to test your configuration
Use Examples: Provide good examples in input_examples for better generation quality
Postprocess Outputs: Use postprocessing to clean and format generated text
Monitor Costs: Be aware of API costs when using commercial models
Validate Results: Review generated samples before using for training
Version Control: Keep your synthesis configs in version control

Common Use Cases#

Question-Answer Generation#

Generate QA pairs from documents or contexts for training conversational models.

Example: See oumi-ai/oumi for a complete geography Q&A generation example.

Data Augmentation#

Create variations of existing datasets by sampling different attributes and regenerating content.

Example: See oumi-ai/oumi for an example that augments existing datasets with different styles and complexity levels.

Instruction Following#

Generate instruction-response pairs with varying complexity and domains.

Example: See oumi-ai/oumi for a multi-domain instruction generation example covering writing, coding, analysis, and more.

Conversation Synthesis#

Create multi-turn conversations by chaining generated responses.

Example: See oumi-ai/oumi for a customer support conversation generation example using chained generated attributes.

Multi-Turn Conversation Synthesis#

Generate dynamic, variable-length conversations using multiturn_attributes. The system plans the conversation, then generates each turn with full conversation context, producing natural back-and-forth dialogue.

Example: See oumi-ai/oumi for a customer support conversation example using multi-turn attributes with conversation planning, role-based instructions, and variable-length conversations (4-12 turns).

Domain Adaptation#

Generate domain-specific training data by conditioning on domain attributes.

Example: See oumi-ai/oumi for a medical domain Q&A generation example with specialty-specific content.

Troubleshooting#

Empty results: Check that your instruction messages are well-formed and you have proper API access.

Slow generation: Increase num_workers or lower politeness_policy to improve throughput.

Out of memory: Use a smaller model or reduce max_new_tokens in generation config.

Validation errors: Ensure all attribute IDs are unique and required fields are not empty.

For more help, see the FAQ or report issues at https://github.com/oumi-ai/oumi/issues.

Data Synthesis

Contents

Data Synthesis#

What You Can Build#

How It Works#

Your First Synthesis#

Understanding the Results#

Next Steps: Building More Complex Datasets#

Adding Multiple Attributes#

Using Your Own Data#

Few-Shot Sampling From Sources#

Creating Conversations#

Creating Multi-Turn Conversations#

Environment-First Tool Synthesis#

Complete Configuration Reference#

Top-Level Parameters#

Strategy Parameters#

Input Sources#

Attribute Types#

Advanced Features#

Attribute Referencing#

Postprocessing#

Transformation Strategies#

String Transformation#

List Transformation#

Dictionary Transformation#

Chat Transformation#

Document Segmentation#

Inference Configuration#

Supported Engines#

Command Line Options#

Output Format#

Batch Inference#

Token Usage Tracking#

Best Practices#

Common Use Cases#

Question-Answer Generation#

Data Augmentation#

Instruction Following#

Conversation Synthesis#

Multi-Turn Conversation Synthesis#

Domain Adaptation#

Troubleshooting#