{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"align-center\">\n",
    "<a href=\"https://oumi.ai/\"><img src=\"https://oumi.ai/docs/en/latest/_static/logo/header_logo.png\" height=\"200\"></a>\n",
    "\n",
    "[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)\n",
    "[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)\n",
    "[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)\n",
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Oumi Judge.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "</div>\n",
    "\n",
    "👋 Welcome to Open Universal Machine Intelligence (Oumi)!\n",
    "\n",
    "🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.\n",
    "\n",
    "🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.\n",
    "\n",
    "⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Simple Judge\n",
    "\n",
    "To enable LLM judgments, Oumi offers [Simple Judge](https://oumi.ai/docs/en/latest/user_guides/judge/judge.html#quick-start), a powerful framework that allows users to set their own evaluation criteria, judgment prompts, output format, and set the underlying model to any open- or closed-source hosted model.\n",
    "\n",
    "## Why Use LLM Judges?\n",
    "\n",
    "As LLMs continue to evolve, traditional evaluation benchmarks, which focus primarily on task-specific metrics, are increasingly inadequate for capturing the full scope of a model's generative potential. In real-world applications, LLM capabilities such as creativity, coherence, and the ability to effectively handle nuanced and open-ended queries are critical and cannot be fully assessed through standardized metrics alone. While human raters are often employed to evaluate these aspects, the process is costly and time-consuming. As a result, the use of LLM-based evaluation systems, or \"LLM judges\", has gained traction as a more scalable and efficient alternative."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "### Oumi Installation\n",
    "\n",
    "First, let's install Oumi. You can find more detailed instructions about Oumi installation [here](https://oumi.ai/docs/en/latest/get_started/installation.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install oumi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tutorial Directory Setup\n",
    "\n",
    "Next, we will create a directory for the tutorial, to store the evaluation configuration and the experimental results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "tutorial_dir = \"judge_tutorial\"\n",
    "\n",
    "Path(tutorial_dir).mkdir(parents=True, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### OpenAI Access\n",
    "\n",
    "In this notebook, we use GPT 4o as the underlying judge model. To access the GPT-4 models, an OpenAI API key is necessary. You can find instructions for creating an OpenAI account and generating an API key on [OpenAI's quickstart webpage](https://platform.openai.com/docs/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"<MY_OPENAI_TOKEN>\"  # Specify your OpenAI API key here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Judge Config Definition\n",
    "\n",
    "The `JudgeConfig` fully defines the judge's behavior. In the example below, we are creating a judge config for a simple truthfulness judge and saving it locally as `./judge_tutorial/judge_config.yaml`\n",
    "\n",
    "To do so, we set the following:\n",
    "- We set a `prompt_template` that clearly states that the model is acting as a truthfulness judge. The judge is boolean, so the prompt explains that responding \"Yes\" indicates a truthful response and \"No\" indicates a dishonest response. It also contains placeholders (`{request}` and `{response}`) for all the inputs expected in the dataset. These will be replaced at runtime with each example's actual inputs.\n",
    "- We set the `response_format` to `JSON`. `JSON` and `XML` are the most reliable output formats for judges. \n",
    "- We set the `judgment_type` to `BOOL`, since this is a boolean (Yes/No) judge\n",
    "- We set `include_explanation` to `True`, so that the judge model also provides a rationale for its judgment.\n",
    "- We set the `model_name` to `gpt-4o` and the `engine` to `OPENAI` to use GPT 4o as our underlying model. The judge framework supports most open- and closed- source models.\n",
    "- We set the `generation` params for our model (`max_new_tokens` to 8192 and `temperature` to 1.0)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "yaml_content = \"\"\"\n",
    "judge_params:\n",
    "  prompt_template: |\n",
    "    You are a truthfulness judge.\n",
    "    You must assess whether the response below is factually accurate.\n",
    "    If the response is truthful, respond with 'Yes'. Otherwise, respond with 'No'.\n",
    "\n",
    "    ***\n",
    "    [request]:\n",
    "    {request}\n",
    "    ***\n",
    "    [response]:\n",
    "    {response}\n",
    "    ***\n",
    "\n",
    "  response_format: JSON\n",
    "  judgment_type: BOOL\n",
    "  include_explanation: True\n",
    "\n",
    "inference_config:\n",
    "  model:\n",
    "    model_name: \"gpt-4o\"\n",
    "\n",
    "  engine: OPENAI\n",
    "\n",
    "  generation:\n",
    "    max_new_tokens: 8192\n",
    "    temperature: 1.0\n",
    "\"\"\"\n",
    "\n",
    "with open(f\"{tutorial_dir}/judge_config.yaml\", \"w\") as f:\n",
    "    f.write(yaml_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset Definition\n",
    "\n",
    "Our dataset must include a `{request}` and a `{response}` for every example, as indicated in the `prompt_template` of our judge config.\n",
    "Here, we include one truthful and one dishonest example. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = [\n",
    "    {\n",
    "        \"request\": \"What's the capital of France?\",\n",
    "        \"response\": \"The capital of France is Paris.\",  # Truthful answer\n",
    "    },\n",
    "    {\n",
    "        \"request\": \"What is the sum of 1 and 1 in binary?\",\n",
    "        \"response\": \"The sum is 11 in binary.\",  # Dishonest answer\n",
    "    },\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Judgement\n",
    "\n",
    "After defining the judge config and the dataset, we are ready to instantiate `SimpleJudge` and perform the judgement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:torchao.kernel.intmm:Warning: Detected no triton, on systems without Triton certain kernels will not work\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 07-22 12:04:03 [__init__.py:256] Automatically detected platform cpu.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 2/2 [01:07<00:00, 33.91s/it]\n"
     ]
    }
   ],
   "source": [
    "from oumi.judges.simple_judge import SimpleJudge\n",
    "\n",
    "truthfulness_judge = SimpleJudge(f\"{tutorial_dir}/judge_config.yaml\")\n",
    "outputs = truthfulness_judge.judge(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inspect Results\n",
    "\n",
    "Finally, we can inspect the judgments and their corresponding explanations as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Request: What's the capital of France?\n",
      "Response: The capital of France is Paris.\n",
      "Judgment: True\n",
      "Explanation: The response accurately states that the capital of France is Paris. Paris has been the capital of France for centuries and is widely recognized as such internationally.\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Request: What is the sum of 1 and 1 in binary?\n",
      "Response: The sum is 11 in binary.\n",
      "Judgment: False\n",
      "Explanation: In binary addition, 1 plus 1 is equal to 10, not 11. Therefore, the response provided is not factually accurate.\n",
      "----------------------------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "for input, output in zip(dataset, outputs):\n",
    "    judgment = output.field_values[\"judgment\"]\n",
    "    explanation = output.field_values[\"explanation\"]\n",
    "    request = input[\"request\"]\n",
    "    response = input[\"response\"]\n",
    "\n",
    "    print(f\"Request: {request}\")\n",
    "    print(f\"Response: {response}\")\n",
    "    print(f\"Judgment: {judgment}\")\n",
    "    print(f\"Explanation: {explanation}\")\n",
    "    print(\"-\" * 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🧭 What's Next?\n",
    "\n",
    "Congrats on finishing this notebook! Feel free to check out our other [notebooks](https://github.com/oumi-ai/oumi/tree/main/notebooks) in the [Oumi GitHub](https://github.com/oumi-ai/oumi), and give us a star! You can also join the Oumi community over on [Discord](https://discord.gg/oumi).\n",
    "\n",
    "📰 Want to keep up with news from Oumi? Subscribe to our [Substack](https://blog.oumi.ai/) and [Youtube](https://www.youtube.com/@Oumi_AI)!\n",
    "\n",
    "⚡ Interested in building custom AI in hours, not months? Apply to get [early access](https://oumi.ai/contact?utm_source=oumi_oss_tutorial_simple_judge) to the Oumi Platform, or [chat with us](https://oumi.ai/book?utm_source=oumi_oss_tutorial_simple_judge) to learn more!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "oumi",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}