{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"align-center\">\n",
    "<a href=\"https://oumi.ai/\"><img src=\"https://oumi.ai/docs/en/latest/_static/logo/header_logo.png\" height=\"200\"></a>\n",
    "\n",
    "[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)\n",
    "[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)\n",
    "[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)\n",
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Evaluation with AlpacaEval 2.0.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "</div>\n",
    "\n",
    "👋 Welcome to Open Universal Machine Intelligence (Oumi)!\n",
    "\n",
    "🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.\n",
    "\n",
    "🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.\n",
    "\n",
    "⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluation with AlpacaEval 2.0\n",
    "\n",
    "This notebook demonstrates how to run end-to-end evaluations for your trained model with [AlpacaEval 2.0](https://github.com/tatsu-lab/alpaca_eval). AlpacaEval is an LLM-based automatic evaluation suite that is fast, cheap, replicable, and validated against 20K human annotations.\n",
    "\n",
    "Evaluating with AlpacaEval is a 2-step process:\n",
    "1. **Inference**: Generate model responses for 805 AlpacaEval prompts using Oumi's inference engine\n",
    "2. **Judgement**: Use GPT-4 Turbo as a judge to compare your model's responses against reference responses and calculate win rates\n",
    "\n",
    "**Resources:**\n",
    "- [AlpacaEval V2.0 Paper](https://arxiv.org/abs/2404.04475)\n",
    "- [AlpacaEval Dataset](https://huggingface.co/datasets/tatsu-lab/alpaca_eval)\n",
    "- [Leaderboard](https://tatsu-lab.github.io/alpaca_eval/)\n",
    "- [Official Repository](https://github.com/tatsu-lab/alpaca_eval)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites and Configuration\n",
    "\n",
    "First, install the required packages. The `alpaca_eval` package requires Python >= 3.10."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install -q oumi alpaca_eval pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "AlpacaEval uses GPT-4 Turbo as the default judge. To access GPT-4 models, an OpenAI API key is required. Details on creating an OpenAI account and generating a key can be found at [OpenAI's quickstart webpage](https://platform.openai.com/docs/quickstart)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"\"  # NOTE: Set your OpenAI API key here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**⚠️ Cost considerations**: AlpacaEval 2.0 uses GPT-4 Turbo to judge 805 examples. Please visit [OpenAI's pricing](https://openai.com/api/pricing/) page for current costs. Since this notebook is sample code, we will only evaluate a small subset of examples to reduce costs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NUM_EXAMPLES = 5  # Set to None to evaluate all 805 examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Configure your model. You can use a HuggingFace model ID or a path to a local model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL_PATH = \"HuggingFaceTB/SmolLM2-135M-Instruct\"\n",
    "MODEL_DISPLAY_NAME = \"my_model\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Load the AlpacaEval Dataset\n",
    "\n",
    "Load the AlpacaEval dataset using Oumi's `AlpacaEvalDataset` class. This dataset contains 805 open-ended prompts for evaluating instruction-following capabilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from oumi.datasets import AlpacaEvalDataset\n",
    "\n",
    "# Load the dataset\n",
    "dataset = AlpacaEvalDataset()\n",
    "\n",
    "print(f\"Dataset size: {len(dataset)} examples\")\n",
    "\n",
    "# Preview a sample prompt using the conversation() method\n",
    "sample_conv = dataset.conversation(0)\n",
    "print(\"\\nSample prompt:\")\n",
    "print(sample_conv.messages[0].content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Extract Prompts for Inference\n",
    "\n",
    "Extract the prompts from the dataset conversations to prepare for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Determine number of examples to evaluate\n",
    "num_to_evaluate = NUM_EXAMPLES if NUM_EXAMPLES else len(dataset)\n",
    "\n",
    "# Extract prompts from conversations\n",
    "prompts = [dataset.conversation(i).messages[0].content for i in range(num_to_evaluate)]\n",
    "\n",
    "print(f\"Extracted {len(prompts)} prompts for evaluation\")\n",
    "print(\"\\nFirst few prompts:\")\n",
    "for i, prompt in enumerate(prompts[:3]):\n",
    "    print(f\"  {i + 1}. {prompt[:80]}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Run Inference with Oumi\n",
    "\n",
    "Generate model responses for the AlpacaEval prompts using Oumi's inference capabilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from oumi import infer\n",
    "from oumi.core.configs import (\n",
    "    GenerationParams,\n",
    "    InferenceConfig,\n",
    "    ModelParams,\n",
    ")\n",
    "\n",
    "# Configure inference\n",
    "config = InferenceConfig(\n",
    "    model=ModelParams(\n",
    "        model_name=MODEL_PATH,\n",
    "        trust_remote_code=True,\n",
    "    ),\n",
    "    generation=GenerationParams(\n",
    "        max_new_tokens=2048,\n",
    "        temperature=0.7,\n",
    "        top_p=0.9,\n",
    "    ),\n",
    ")\n",
    "\n",
    "print(f\"Running inference on {len(prompts)} prompts...\")\n",
    "print(f\"Model: {MODEL_PATH}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run inference - this returns Conversation objects with both prompt and response\n",
    "responses = infer(config, prompts)\n",
    "\n",
    "print(f\"\\nGenerated {len(responses)} responses\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect a sample response\n",
    "print(\"Sample response:\")\n",
    "print(f\"Prompt: {responses[0].messages[0].content[:200]}...\")\n",
    "print(f\"\\nResponse: {responses[0].messages[-1].content[:500]}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Format Responses for AlpacaEval\n",
    "\n",
    "Convert the Oumi conversation format to the AlpacaEval format required by the evaluation framework."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from oumi.datasets.evaluation.utils import conversations_to_alpaca_format\n",
    "\n",
    "# Convert to AlpacaEval format\n",
    "alpaca_format_responses = conversations_to_alpaca_format(responses)\n",
    "\n",
    "# Add generator name to each response (required by AlpacaEval)\n",
    "for response in alpaca_format_responses:\n",
    "    response[\"generator\"] = MODEL_DISPLAY_NAME\n",
    "\n",
    "# Convert to DataFrame for AlpacaEval\n",
    "responses_df = pd.DataFrame(alpaca_format_responses)\n",
    "\n",
    "print(f\"Formatted {len(responses_df)} responses for AlpacaEval\")\n",
    "print(f\"\\nColumns: {list(responses_df.columns)}\")\n",
    "responses_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Run AlpacaEval Judgment\n",
    "\n",
    "Use the AlpacaEval framework to judge your model's responses against reference responses. The judge (GPT-4 Turbo by default) compares each response and calculates win rates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import alpaca_eval\n",
    "\n",
    "# Set AlpacaEval 2.0 configuration\n",
    "os.environ[\"IS_ALPACA_EVAL_2\"] = \"True\"\n",
    "\n",
    "# Run evaluation\n",
    "# Note: This will make API calls to the judge model (GPT-4 Turbo)\n",
    "print(\"Running AlpacaEval judgment...\")\n",
    "print(\"This may take a few minutes depending on the number of examples.\\n\")\n",
    "\n",
    "result = alpaca_eval.evaluate(\n",
    "    model_outputs=responses_df,\n",
    "    annotators_config=\"weighted_alpaca_eval_gpt4_turbo\",\n",
    "    is_return_instead_of_print=True,\n",
    "    max_instances=num_to_evaluate,\n",
    ")\n",
    "\n",
    "print(\"\\nEvaluation complete!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: View Results\n",
    "\n",
    "Examine the evaluation results including win rates and other metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display results\n",
    "if result is not None:\n",
    "    print(\"=\" * 50)\n",
    "    print(\"AlpacaEval 2.0 Results\")\n",
    "    print(\"=\" * 50)\n",
    "    print(result)\n",
    "else:\n",
    "    print(\"Results were printed above.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Optional] Save Results for Reproducibility\n",
    "\n",
    "Save the configuration and results for future reference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datetime\n",
    "import json\n",
    "\n",
    "# Save configuration and results\n",
    "evaluation_config_dict = {\n",
    "    \"model\": {\n",
    "        \"model_path\": MODEL_PATH,\n",
    "        \"model_display_name\": MODEL_DISPLAY_NAME,\n",
    "    },\n",
    "    \"alpaca_eval\": {\n",
    "        \"version\": \"2.0\",\n",
    "        \"annotator\": \"weighted_alpaca_eval_gpt4_turbo\",\n",
    "        \"num_examples\": num_to_evaluate,\n",
    "    },\n",
    "    \"timestamp\": str(datetime.datetime.now()),\n",
    "}\n",
    "\n",
    "# Save to file\n",
    "output_path = \"alpaca_eval_config.json\"\n",
    "with open(output_path, \"w\") as f:\n",
    "    json.dump(evaluation_config_dict, f, indent=2)\n",
    "\n",
    "print(f\"Configuration saved to {output_path}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save model responses\n",
    "responses_path = f\"{MODEL_DISPLAY_NAME}_alpaca_eval_responses.json\"\n",
    "responses_df.to_json(responses_path, orient=\"records\", indent=2)\n",
    "print(f\"Responses saved to {responses_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🧭 What's Next?\n",
    "\n",
    "Congrats on finishing this notebook! Feel free to check out our other [notebooks](https://github.com/oumi-ai/oumi/tree/main/notebooks) in the [Oumi GitHub](https://github.com/oumi-ai/oumi), and give us a star! You can also join the Oumi community over on [Discord](https://discord.gg/oumi).\n",
    "\n",
    "📰 Want to keep up with news from Oumi? Subscribe to our [Substack](https://blog.oumi.ai/) and [Youtube](https://www.youtube.com/@Oumi_AI)!\n",
    "\n",
    "⚡ Interested in building custom AI in hours, not months? Apply to get [early access](https://oumi.ai/contact?utm_source=oumi_oss_tutorial_eval_alpaca) to the Oumi Platform, or [chat with us](https://oumi.ai/book?utm_source=oumi_oss_tutorial_eval_alpaca) to learn more!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "oumi",
   "language": "python",
   "name": "oumi"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}