{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"align-center\">\n",
    "<a href=\"https://oumi.ai/\"><img src=\"https://oumi.ai/docs/en/latest/_static/logo/header_logo.png\" height=\"200\"></a>\n",
    "\n",
    "[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)\n",
    "[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)\n",
    "[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)\n",
    "</div>\n",
    "\n",
    "👋 Welcome to Open Universal Machine Intelligence (Oumi)!\n",
    "\n",
    "🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.\n",
    "\n",
    "🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.\n",
    "\n",
    "⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Distillation Overview\n",
    "\n",
    "In this tutorial, we'll fine-tune a small language model (SLM) from the outputs of a large language model (LLM).\n",
    "\n",
    "We'll use the Oumi framework to streamline the process and achieve high-quality results.\n",
    "\n",
    "We'll cover the following topics:\n",
    "1. Prerequisites\n",
    "2. Data Preparation & Sanity Checks\n",
    "3. Training Config Preparation\n",
    "4. Launching Training\n",
    "5. Monitoring Progress\n",
    "6. Evaluation\n",
    "7. Analyzing Results\n",
    "8. Inference\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prerequisites\n",
    "\n",
    "## Hardware\n",
    "The defaults in this tutorial are scaled down for demonstration purposes.\n",
    "\n",
    "The true values are left to code comments within each section.\n",
    "\n",
    "We recommend 8xA100-80GB GPUs to complete in a timely manner with adequate performance.\n",
    "\n",
    "## Oumi Installation\n",
    "\n",
    "First, let's install Oumi and vLLM. You can find more detailed instructions [here](https://oumi.ai/docs/en/latest/get_started/installation.html). Here, we include Oumi's GPU dependencies.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install oumi[gpu]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating our working directory\n",
    "For our experiments, we'll use the following folder to save the model, training artifacts, and our working configs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "tutorial_dir = \"distillation_tutorial\"\n",
    "\n",
    "Path(tutorial_dir).mkdir(parents=True, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup the environment\n",
    "\n",
    "We'll need to set the following environment variables:\n",
    "- [Optional] HF_TOKEN: Your [HuggingFace](https://huggingface.co/docs/hub/en/security-tokens) token, in case you want to access a private model.\n",
    "- [Optional] WANDB_API_KEY: Your [wandb](https://wandb.ai) token, in case you want to log your experiments to wandb."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Started"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Download\n",
    "\n",
    "For our purposes it will be much faster if we download our models first.\n",
    "\n",
    "We'll use the `hf_transfer` package to download."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install hf_transfer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!HF_HUB_ENABLE_HF_TRANSFER=1 \\\n",
    "    hf download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \\\n",
    "    --exclude original/*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!HF_HUB_ENABLE_HF_TRANSFER=1 \\\n",
    "    hf download deepseek-ai/DeepSeek-R1-Distill-Llama-70B \\\n",
    "    --exclude original/*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline Evals\n",
    "\n",
    "Before we can improve our small model, we should measure how well it performs on a benchmark compared to the larger model.\n",
    "\n",
    "The below code will run the MMLU PRO Math task from LM Harness. \n",
    "\n",
    "Note that this will take some time, so we've recorded our results below for your convenience:\n",
    "\n",
    "| Model | MMLU Pro Math Accuracy |\n",
    "|-------|------------------------|\n",
    "| R1 Distill 1.5B | 38.49% +- 1.32% |\n",
    "| R1 Distill 70B | 61.07% +- 1.33% |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run Evals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile $tutorial_dir/eval_small.yaml\n",
    "\n",
    "model:\n",
    "  model_name: \"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\"\n",
    "  torch_dtype_str: \"bfloat16\"\n",
    "  # shard_for_eval: True # Uncomment this line for multi-gpu setups.\n",
    "\n",
    "\n",
    "tasks:\n",
    "  - evaluation_backend: lm_harness\n",
    "    task_name: mmlu_pro_math\n",
    "\n",
    "output_dir: \"distillation_tutorial/output/evaluation\"\n",
    "generation:\n",
    "  batch_size: 1 # LM Harness recommends BS=1 for reproducibility.\n",
    "  # batch_size: 256  # Replace with 256 for 8xA100-80GB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!oumi evaluate -c \"$tutorial_dir/eval_small.yaml\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile $tutorial_dir/eval_large.yaml\n",
    "\n",
    "model:\n",
    "  model_name: \"deepseek-ai/DeepSeek-R1-Distill-Llama-70B\"\n",
    "  torch_dtype_str: \"bfloat16\"\n",
    "  shard_for_eval: True\n",
    "\n",
    "\n",
    "tasks:\n",
    "  - evaluation_backend: lm_harness\n",
    "    task_name: mmlu_pro_math\n",
    "\n",
    "output_dir: \"distillation_tutorial/output/evaluation\"\n",
    "generation:\n",
    "  batch_size: 1 # LM Harness recommends BS=1 for reproducibility.\n",
    "  # batch_size: 64  # Replace with 64 for 8xA100-80GB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!oumi evaluate -c \"$tutorial_dir/eval_large.yaml\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare Inference Data\n",
    "\n",
    "Now that we've set our baseline numbers, let's prepare the training data we'll use to improve 1.5B.\n",
    "\n",
    "Given our goal is to improve MMLU Pro Math performance, we should ideally pick data that's similar.\n",
    "\n",
    "`meta-math/MetaMathQA` is a good choice as it avoids test set contamination while being similar."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import datasets\n",
    "import torch\n",
    "\n",
    "from oumi.core.configs import InferenceConfig\n",
    "from oumi.core.types import Conversation, Message, Role\n",
    "from oumi.inference import VLLMInferenceEngine\n",
    "\n",
    "# This is needed for vLLM to use multiple GPUs in a notebook.\n",
    "# If you're not running in a notebook, you can ignore this.\n",
    "os.environ[\"VLLM_WORKER_MULTIPROC_METHOD\"] = \"spawn\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = datasets.load_dataset(\n",
    "    \"meta-math/MetaMathQA\",\n",
    "    revision=\"aa4f34d\",\n",
    "    split=\"train[:10000]\",  # We'll focus only on the first 10k samples.\n",
    ")\n",
    "\n",
    "data = [sample[\"query\"] for sample in dataset]\n",
    "print(data[0])\n",
    "print(\"num samples: \", len(data))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conversations = [\n",
    "    Conversation(\n",
    "        messages=[\n",
    "            Message(role=Role.USER, content=prompt),\n",
    "        ]\n",
    "    )\n",
    "    for prompt in data\n",
    "]\n",
    "print(conversations[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run Inference\n",
    "\n",
    "Now that our data is in the right format for collecting responses, let's go ahead and run inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile $tutorial_dir/infer_large.yaml\n",
    "\n",
    "model:\n",
    "  model_name: \"deepseek-ai/DeepSeek-R1-Distill-Llama-70B\"\n",
    "  torch_dtype_str: \"bfloat16\"\n",
    "  model_max_length: 8192\n",
    "\n",
    "generation:\n",
    "  max_new_tokens: 8192"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# Download, and load the model in memory\n",
    "# This may take a while, depending on your internet speed.\n",
    "# The inference engine only needs to be loaded once and can be\n",
    "# reused for multiple conversations.\n",
    "config_path = f\"{tutorial_dir}/infer_large.yaml\"\n",
    "config = InferenceConfig.from_yaml(config_path)\n",
    "\n",
    "inference_engine = VLLMInferenceEngine(\n",
    "    config.model,\n",
    "    tensor_parallel_size=torch.cuda.device_count(),  # use all available GPUs\n",
    "    # Enable prefix caching for vLLM.\n",
    "    # This is key for performance when running prompts with a long prefix,\n",
    "    # such as judging or conversations with large system prompts\n",
    "    # or few-shot examples.\n",
    "    enable_prefix_caching=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "print(f\"Running inference for {len(conversations)} conversations\")\n",
    "\n",
    "generations = inference_engine.infer(\n",
    "    input=conversations,\n",
    "    inference_config=config,\n",
    ")\n",
    "print(generations[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare Training Data\n",
    "\n",
    "Now that we've finished collecting responses, let's go ahead and prepare the data for training and save it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conversation_dicts = [c.to_dict() for c in generations]\n",
    "print(conversation_dicts[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "dataframe = pd.DataFrame(conversation_dicts)\n",
    "print(dataframe)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataframe.to_json(f\"{tutorial_dir}/math_train_10k.jsonl\", orient=\"records\", lines=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run Distillation\n",
    "\n",
    "Now that the data is ready, we can begin distilling the model. For this form of distillation, we will be fully fine-tuning the model with supervised fine-tuning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile $tutorial_dir/train.yaml\n",
    "\n",
    "model:\n",
    "  model_name: \"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\"\n",
    "  trust_remote_code: true\n",
    "  torch_dtype_str: \"bfloat16\"\n",
    "  device_map: \"auto\"\n",
    "\n",
    "data:\n",
    "  train:\n",
    "    datasets:\n",
    "      - dataset_name: \"text_sft_jsonl\"\n",
    "        dataset_path: \"./distillation_tutorial/math_train_10k.jsonl\"\n",
    "        split: \"train\"\n",
    "        shuffle: True\n",
    "        seed: 42\n",
    "    seed: 42\n",
    "\n",
    "training:\n",
    "  output_dir: \"distillation_tutorial/output/finetune\"\n",
    "\n",
    "  # For a single GPU, the following gives us a batch size of 16\n",
    "  # If training with multiple GPUs, feel free to reduce gradient_accumulation_steps\n",
    "  per_device_train_batch_size: 2\n",
    "  gradient_accumulation_steps: 8  # Reduce this to 1 for 8xA100-80GB GPUs\n",
    "  \n",
    "  # ***NOTE***\n",
    "  # We set it to 10 steps to first verify that it works\n",
    "  # Comment out the line below to have it train for 1 full epoch (all the data) instead.\n",
    "  # Note: 1 full epoch will take about 13 minutes on 8xA100-80GB.\n",
    "  max_steps: 10\n",
    "\n",
    "  num_train_epochs: 1\n",
    "  learning_rate: 1e-4\n",
    "  warmup_ratio: 0.1\n",
    "  logging_steps: 10\n",
    "  save_steps: 0\n",
    "  max_grad_norm: 10\n",
    "  weight_decay: 0.01\n",
    "\n",
    "  \n",
    "  trainer_type: \"TRL_SFT\"\n",
    "  optimizer: \"adamw_torch_fused\"\n",
    "  enable_gradient_checkpointing: True\n",
    "  gradient_checkpointing_kwargs:\n",
    "    use_reentrant: False\n",
    "  ddp_find_unused_parameters: False\n",
    "  dataloader_num_workers: \"auto\"\n",
    "  dataloader_prefetch_factor: 32\n",
    "  empty_device_cache_steps: 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Single GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!oumi train -c \"$tutorial_dir/train.yaml\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multi-GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!oumi distributed torchrun -m oumi train -c \"$tutorial_dir/train.yaml\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate\n",
    "\n",
    "Now that we have a new distilled model, let's evaluate it on the same benchmark."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile $tutorial_dir/eval_small_fft.yaml\n",
    "\n",
    "model:\n",
    "  model_name: \"./distillation_tutorial/output/finetune/\"\n",
    "  torch_dtype_str: \"bfloat16\"\n",
    "  shard_for_eval: True\n",
    "\n",
    "\n",
    "tasks:\n",
    "  - evaluation_backend: lm_harness\n",
    "    task_name: mmlu_pro_math\n",
    "\n",
    "output_dir: \"distillation_tutorial/output/evaluation\"\n",
    "generation:\n",
    "  batch_size: 1 # LM Harness recommends BS=1 for reproducibility.\n",
    "  # batch_size: 256  # Replace with 256 for 8xA100-80GB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!oumi evaluate -c \"$tutorial_dir/eval_small_fft.yaml\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Results\n",
    "\n",
    "After we finetuned the model following the steps above, we achieved the following results:\n",
    "\n",
    "| Model           | Accuracy        |\n",
    "|-----------------|-----------------|\n",
    "| R1 Distill 1.5B | 38.49% +- 1.32% |\n",
    "| Oumi R1 Distill 1.5B | 42.41% +- 1.34% |\n",
    "| R1 Distill 70B  | 61.07% +- 1.33% |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🧭 What's Next?\n",
    "\n",
    "Congrats on finishing this notebook! Feel free to check out our other [notebooks](https://github.com/oumi-ai/oumi/tree/main/notebooks) in the [Oumi GitHub](https://github.com/oumi-ai/oumi), and give us a star! You can also join the Oumi community over on [Discord](https://discord.gg/oumi).\n",
    "\n",
    "📰 Want to keep up with news from Oumi? Subscribe to our [Substack](https://blog.oumi.ai/) and [Youtube](https://www.youtube.com/@Oumi_AI)!\n",
    "\n",
    "⚡ Interested in building custom AI in hours, not months? Apply to get [early access](https://oumi.ai/contact?utm_source=oumi_oss_tutorial_distill_large_model) to the Oumi Platform, or [chat with us](https://oumi.ai/book?utm_source=oumi_oss_tutorial_distill_large_model) to learn more!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "oumi",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}