Data Synthesis#
The oumi synth command enables you to generate synthetic datasets using large language models. Instead of manually creating training data, you can define rules and templates that automatically generate diverse, high-quality examples.
What You Can Build#
Question-Answer datasets for training chatbots
Instruction-following datasets with varied complexity levels
Domain-specific training data (legal, medical, technical)
Conversation datasets with different personas or styles
Data augmentation to expand existing small datasets
How It Works#
The synthesis process follows three steps:
Define attributes - What varies in your data (topic, difficulty, style, etc.)
Create templates - How the AI should generate content using those attributes
Generate samples - The system creates many examples by combining different attribute values
Your First Synthesis#
Let’s create a simple question-answer dataset. Save this as my_first_synth.yaml:
# Generate 10 geography questions
strategy: GENERAL
num_samples: 10
output_path: geography_qa.jsonl
strategy_params:
# Give the AI an example to learn from
input_examples:
- examples:
- example_question: "What is the capital of France?"
# Define what should vary across examples
sampled_attributes:
- id: difficulty
name: Difficulty Level
description: How challenging the question should be
possible_values:
- id: easy
name: Easy
description: Basic facts everyone should know
- id: hard
name: Hard
description: Detailed knowledge for experts
# Tell the AI how to generate questions and answers
generated_attributes:
- id: question
instruction_messages:
- role: SYSTEM
content: "You are a geography teacher creating quiz questions. Example: {example_question}"
- role: USER
content: "Create a {difficulty} geography question. Write the question only, not the answer."
- id: answer
instruction_messages:
- role: SYSTEM
content: "You are a helpful AI assistant."
- role: USER
content: "{question}"
# Configure which AI model to use
inference_config:
model:
model_name: claude-3-5-sonnet-20240620
engine: ANTHROPIC
Run it with:
oumi synth -c my_first_synth.yaml
What happens: The system will create 10 geography questions, some easy and some hard, saved to geography_qa.jsonl.
Understanding the Results#
After running synthesis, you’ll see:
A preview table showing the first few generated samples
The total number of samples created
Instructions for using the dataset in training
Each line in the output file contains one example:
{"difficulty": "easy", "question": "What is the largest continent?", "answer": "Asia"}
{"difficulty": "hard", "question": "Which country has the most time zones?", "answer": "France"}
Next Steps: Building More Complex Datasets#
Once you’re comfortable with the basics, you can create more sophisticated datasets:
Adding Multiple Attributes#
Mix and match different properties (topic + difficulty + style):
sampled_attributes:
- id: topic
possible_values: [{id: geography}, {id: history}, {id: science}]
- id: difficulty
possible_values: [{id: easy}, {id: medium}, {id: hard}]
- id: style
possible_values: [{id: formal}, {id: casual}, {id: academic}]
Using Your Own Data#
Feed in existing datasets or documents:
input_data:
- path: "my_existing_data.jsonl"
input_documents:
- path: "textbook.pdf"
Supported dataset formats (input_data): JSONL, JSON, CSV, TSV, Parquet, and XLSX. For XLSX files, every sheet is concatenated into a single dataset, so you can keep related tabs in one workbook. Globs are supported:
input_data:
- path: "data/**/*.xlsx"
Supported document formats (input_documents): .pdf, .txt, .md, .html, and .docx. DOCX files are parsed paragraph-by-paragraph.
Note
XLSX / DOCX parsing require the synthesis extras: pip install oumi[synthesis].
Few-Shot Sampling From Sources#
When you want each synthesised sample to see multiple randomly-drawn items from a source (examples, datasets, or documents), use num_shots. This turns the source into a dynamic few-shot pool instead of round-robin enumeration.
input_examples:
- id: few_shot_examples
num_shots: 3 # draw 3 examples per synthesis sample
examples:
- task_type: "summarization"
example_input: "..."
- task_type: "translation"
example_input: "..."
# ...
generated_attributes:
- id: instruction
instruction_messages:
- role: USER
content: |
Example 1: {few_shot_examples[0].example_input}
Example 2: {few_shot_examples[1].example_input}
Example 3: {few_shot_examples[2].example_input}
Now produce a new, different example.
Rules:
num_shots: Noneor1→ the source behaves as before (round-robin), reference fields as{id.field}.num_shots > 1→ bracket notation{id[i].field}is required, andidmust be set.Works uniformly across
input_examples,input_data, andinput_documents.
A runnable example lives at oumi-ai/oumi.
Creating Conversations#
Build multi-turn dialogues with fixed structure using transformed attributes:
transformed_attributes:
- id: conversation
transformation_strategy:
type: CHAT
chat_transform:
messages:
- role: USER
content: "{question}"
- role: ASSISTANT
content: "{answer}"
Creating Multi-Turn Conversations#
For dynamic, variable-length conversations, use multiturn_attributes. Each turn is generated by the model with full conversation context, producing natural back-and-forth dialogue:
multiturn_attributes:
- id: support_conversation
min_turns: 4
max_turns: 12
role_instruction_messages:
USER: |
You are a customer contacting support.
Your issue: {issue_detail}
ASSISTANT: |
You are a helpful support agent.
Be professional and resolve the customer's issue.
output_system_prompt: |
You are a helpful support agent.
Ready to dive deeper? The sections below cover all available options in detail.
Environment-First Tool Synthesis#
Agentic synthesis now follows an environment-first model. Tools do not declare an output strategy directly. Instead, each tool is bound to an environment, and the environment type defines how tool calls are executed via its step() method.
syntheticenvironments are backed by an LLM that simulates tool execution. They can be stateless (no persistent state) or stateful (mutable JSON state across turns). Statefulness is controlled by the optionalstate_paramsfield — when provided, the environment tracks and mutates state across calls; when absent, each call is independent.deterministicenvironments behave like lookup tables. Each tool defines a set of input-to-output mappings, andstep()resolves tool calls by matching arguments against those mappings. No LLM is involved.
At the config level:
Environments own their tool definitions.
Reusable environment catalogs live in top-level
environment_configorenvironment_config_path.Tools do not declare an
environmentfield. The parent environment owns the binding.deterministic_outputsis only used for tools indeterministicenvironments.read_onlyis only meaningful for tools in statefulsyntheticenvironments.Multiturn attributes reference environments (not individual tools) to select which tools are available.
Example:
environment_config:
environments:
- id: support_backend
name: Support Backend
description: Simulated support system with tickets and users
type: synthetic
system_prompt: You manage a customer support system with tickets and users.
state_params:
state_schema:
type: object
properties:
tickets: { type: array }
users: { type: array }
initial_state:
tickets: []
users: []
tools:
- id: get_ticket
name: GetTicket
description: Read a ticket from the support backend.
read_only: true
parameters:
type: object
properties:
ticket_id: { type: string }
- id: create_ticket
name: CreateTicket
description: Create a new support ticket.
read_only: false
parameters:
type: object
properties:
subject: { type: string }
priority: { type: string, enum: [low, medium, high] }
- id: faq_lookup
name: FAQ Lookup
description: Cached LLM-backed FAQ answers
type: synthetic
system_prompt: Generate concise FAQ answers grounded in the tool contract.
cache_by_input: true
tools:
- id: answer_faq
name: AnswerFAQ
description: Answer common support questions.
parameters:
type: object
properties:
question: { type: string }
- id: policy_table
name: Policy Table
description: Predefined policy responses
type: deterministic
tools:
- id: get_refund_policy
name: GetRefundPolicy
description: Return the matching refund policy.
parameters:
type: object
properties:
policy_type: { type: string }
deterministic_outputs:
- input:
policy_type: standard
output:
policy: Standard 30-day refund policy
strategy_params:
multiturn_attributes:
- id: support_chat
min_turns: 2
max_turns: 4
role_instruction_messages:
USER: You are a customer contacting support.
ASSISTANT: You are a helpful support agent.
available_environments: [support_backend, faq_lookup, policy_table]
Complete Configuration Reference#
Top-Level Parameters#
strategy: The synthesis strategy to use (currently onlyGENERALis supported)num_samples: Number of synthetic samples to generateoutput_path: Path where the generated dataset will be saved (must end with.jsonl)strategy_params: Parameters specific to the synthesis strategyinference_config: Configuration for the model used in generation
Strategy Parameters#
The strategy_params section defines the core synthesis logic:
Input Sources#
You can provide data from multiple sources:
input_data: Existing datasets to sample from
input_data:
- path: "hf:dataset_name" # HuggingFace dataset
hf_split: train
- path: "/path/to/local/data.jsonl" # Local file
attribute_map:
old_column_name: new_attribute_name
input_documents: Documents to segment and use in synthesis
input_documents:
- path: "/path/to/document.pdf"
id: my_doc
segmentation_params:
id: doc_segment
segment_length: 2048
segment_overlap: 200
input_examples: Inline examples for few-shot learning
input_examples:
- examples:
- attribute1: "value1"
attribute2: "value2"
- attribute1: "value3"
attribute2: "value4"
Attribute Types#
Sampled Attributes: Randomly selected values from predefined options
sampled_attributes:
- id: difficulty
name: Difficulty Level
description: How challenging the question should be
possible_values:
- id: easy
name: Easy
description: Simple, straightforward questions
sample_rate: 0.4 # 40% of samples
- id: medium
name: Medium
description: Moderately challenging questions
sample_rate: 0.4 # 40% of samples
- id: hard
name: Hard
description: Complex, advanced questions
# No sample_rate specified = 20% (remaining)
Generated Attributes: Created by LLM using instruction messages
generated_attributes:
- id: summary
instruction_messages:
- role: SYSTEM
content: "You are a helpful summarization assistant."
- role: USER
content: "Summarize this text: {input_text}. Format your result as 'Summary: <summary>'"
postprocessing_params:
id: clean_summary
cut_prefix: "Summary: "
strip_whitespace: true
Multi-Turn Attributes: Dynamic, variable-length conversations generated turn-by-turn
Unlike generated attributes which produce single values, multi-turn attributes generate full conversations where each turn is produced by the model with the complete conversation history as context. The system automatically plans the conversation, then generates each turn sequentially.
multiturn_attributes:
- id: support_conversation
min_turns: 4
max_turns: 12
role_instruction_messages:
USER: |
You are {customer_name}, a {customer_type.name} customer.
Your issue: {issue_detail}
Stay in character and respond naturally.
ASSISTANT: |
You are a helpful support agent.
Be professional and work toward resolving the issue.
output_system_prompt: |
You are a helpful support agent.
conversation_planner: |
Optional custom instructions for the conversation planner.
Key parameters:
id: Unique identifier for the attribute. The generated conversation is stored under this ID, and a conversation plan is automatically stored under{id}_plan.min_turns/max_turns: Controls the conversation length range. Each turn is one message (user or assistant).role_instruction_messages: Per-role instruction templates. Must define bothUSERandASSISTANTroles. These templates can reference any previously defined attributes using{placeholder}syntax.output_system_prompt: Optional system prompt prepended to the final output conversation.conversation_planner: Optional custom instructions for the conversation planner that generates a turn-by-turn plan before the conversation begins.
The output is a conversation object with a list of messages:
{
"support_conversation": {
"messages": [
{"role": "system", "content": "You are a helpful support agent."},
{"role": "user", "content": "I need help with my order."},
{"role": "assistant", "content": "I'd be happy to help. What's your order number?"},
{"role": "user", "content": "It's BT-12345."},
{"role": "assistant", "content": "Let me look that up for you."}
]
},
"support_conversation_plan": "Turn 1: Customer opens with issue..."
}
Transformed Attributes: Rule-based transformations of existing attributes
transformed_attributes:
- id: conversation
transformation_strategy:
type: CHAT
chat_transform:
messages:
- role: USER
content: "{question}"
- role: ASSISTANT
content: "{answer}"
Advanced Features#
Combination Sampling: Control probability of specific attribute combinations
combination_sampling:
- combination:
difficulty: hard
topic: science
sample_rate: 0.1 # 10% of samples will have hard science questions
Passthrough Attributes: Specify which attributes to include in final output
passthrough_attributes:
- question
- answer
- difficulty
- topic
Attribute Referencing#
In instruction messages and transformations, you can reference attributes using {attribute_id} syntax:
{attribute_id}: The value/name of the attribute{attribute_id.description}: The description of a sampled attribute value{attribute_id.parent}: The parent name of a sampled attribute{attribute_id.parent.description}: The parent description of a sampled attribute
Postprocessing#
Generated attributes can be postprocessed to clean up the output:
postprocessing_params:
id: cleaned_attribute
keep_original_text_attribute: true # Keep original alongside cleaned version
cut_prefix: "Answer: " # Remove this prefix and everything before it
cut_suffix: "\n\n" # Remove this suffix and everything after it
regex: "\\*\\*(.+?)\\*\\*" # Extract content between ** **
strip_whitespace: true # Remove leading/trailing whitespace
added_prefix: "Response: " # Add this prefix
added_suffix: "." # Add this suffix
Transformation Strategies#
For the following examples, let’s assume we have a data sample with the following values.
{
"question": "What color is the sky?",
"answer": "The sky is blue."
}
String Transformation#
transformed_attributes:
- id: example_string_attribute
transformation_strategy:
type: STRING
string_transform: "Question: {question}\nAnswer: {answer}"
Example Result:
{
"example_string_attribute": "Question: What color is the sky?\nAnswer: The sky is blue."
}
List Transformation#
transformed_attributes:
- id: example_list_attribute
transformation_strategy:
type: LIST
list_transform:
- "{question}"
- "{answer}"
Example Result:
{
"example_list_attribute": [
"What color is the sky?",
"The sky is blue.",
]
}
Dictionary Transformation#
transformed_attributes:
- id: example_dict_attribute
transformation_strategy:
type: DICT
dict_transform:
question: "{question}"
answer: "{answer}"
Example Result:
{
"example_list_attribute": {
"question": "What color is the sky?",
"answer": "The sky is blue.",
}
}
Chat Transformation#
transformed_attributes:
- id: string_attribute
transformation_strategy:
type: CHAT
chat_transform:
messages:
- role: USER
content: "{question}"
- role: ASSISTANT
content: "{answer}"
Document Segmentation#
When using documents, you can segment them for processing:
input_documents:
- path: "/path/to/document.pdf"
id: research_paper
segmentation_params:
id: paper_segment
segmentation_strategy: TOKENS
tokenizer: "openai-community/gpt2"
segment_length: 1024
segment_overlap: 128
keep_original_text: true
Inference Configuration#
Configure the model and generation parameters:
inference_config:
model:
model_name: "claude-3-5-sonnet-20240620"
engine: ANTHROPIC
generation:
max_new_tokens: 1024
temperature: 0.7
top_p: 0.9
remote_params:
num_workers: 5
politeness_policy: 60 # Delay between requests in seconds
Supported Engines#
ANTHROPIC: Claude models (requires API key)OPENAI: OpenAI models (requires API key)VLLM: Local vLLM inference serverNATIVE_TEXT: Local HuggingFace transformersAnd many more (see Inference Engines)
Command Line Options#
The oumi synth command supports these options:
--config,-c: Path to synthesis configuration file (required)--level: Set logging level (DEBUG, INFO, WARNING, ERROR)
You can also use CLI overrides to modify configuration parameters:
oumi synth -c config.yaml \
--num_samples 50 \
--inference_config.generation.temperature 0.5 \
--strategy_params.sampled_attributes[0].possible_values[0].sample_rate 0.8
Output Format#
The synthesized dataset is saved as a JSONL file where each line contains a JSON object with the attributes in the config:
{"difficulty": "easy", "topic": "geography", "question": "What is the capital of France?", "answer": "Paris"}
{"difficulty": "medium", "topic": "history", "question": "When did World War II end?", "answer": "World War II ended in 1945"}
After synthesis completes, you’ll see a preview table and instructions on how to use the generated dataset for training:
Successfully synthesized 100 samples and saved to synthetic_qa_dataset.jsonl
To train a model, run: oumi train -c path/to/your/train/config.yaml
If you included a 'conversation' chat attribute in your config, update the
config to use your new dataset:
data:
train:
datasets:
- dataset_name: "text_sft_jsonl"
dataset_path: "synthetic_qa_dataset.jsonl"
Batch Inference#
When your synthesis provider supports batch inference (OpenAI, Anthropic, Together, Fireworks, Parasail — see Inference Engines), you can submit all prompts for a single attribute as a batch job rather than calling the API online:
from oumi.core.synthesis.attribute_synthesizer import AttributeSynthesizer
synth = AttributeSynthesizer(config)
# Submit a batch job for one generated attribute
batch_id = synth.synthesize_batch(samples, generated_attribute)
# Later, retrieve results
results = synth.get_batch_results(batch_id, samples, generated_attribute)
# Or tolerate per-row failures:
partial = synth.get_batch_results_partial(batch_id, samples, generated_attribute)
Batches are typically 50% cheaper than online inference at the cost of a 24-hour completion window. Attributes are batched one at a time (not across attributes), so chained generated_attributes still run sequentially.
Token Usage Tracking#
AttributeSynthesizer accumulates token usage across every online and batch call:
print(synth.total_input_tokens) # prompt_tokens across all calls
print(synth.total_output_tokens) # completion_tokens
print(synth.total_cached_tokens) # prompt tokens served from provider cache
Use these counters for cost reporting across an entire synthesis run. See also Token Usage Tracking on the inference engine side.
Best Practices#
Start Small: Begin with a small
num_samplesto test your configurationUse Examples: Provide good examples in
input_examplesfor better generation qualityPostprocess Outputs: Use postprocessing to clean and format generated text
Monitor Costs: Be aware of API costs when using commercial models
Validate Results: Review generated samples before using for training
Version Control: Keep your synthesis configs in version control
Common Use Cases#
Question-Answer Generation#
Generate QA pairs from documents or contexts for training conversational models.
Example: See oumi-ai/oumi for a complete geography Q&A generation example.
Data Augmentation#
Create variations of existing datasets by sampling different attributes and regenerating content.
Example: See oumi-ai/oumi for an example that augments existing datasets with different styles and complexity levels.
Instruction Following#
Generate instruction-response pairs with varying complexity and domains.
Example: See oumi-ai/oumi for a multi-domain instruction generation example covering writing, coding, analysis, and more.
Conversation Synthesis#
Create multi-turn conversations by chaining generated responses.
Example: See oumi-ai/oumi for a customer support conversation generation example using chained generated attributes.
Multi-Turn Conversation Synthesis#
Generate dynamic, variable-length conversations using multiturn_attributes. The system plans the conversation, then generates each turn with full conversation context, producing natural back-and-forth dialogue.
Example: See oumi-ai/oumi for a customer support conversation example using multi-turn attributes with conversation planning, role-based instructions, and variable-length conversations (4-12 turns).
Domain Adaptation#
Generate domain-specific training data by conditioning on domain attributes.
Example: See oumi-ai/oumi for a medical domain Q&A generation example with specialty-specific content.
Troubleshooting#
Empty results: Check that your instruction messages are well-formed and you have proper API access.
Slow generation: Increase num_workers or lower politeness_policy to improve throughput.
Out of memory: Use a smaller model or reduce max_new_tokens in generation config.
Validation errors: Ensure all attribute IDs are unique and required fields are not empty.
For more help, see the FAQ or report issues at https://github.com/oumi-ai/oumi/issues.