chatGPT Models Experiment

chatGPT Models Experiment

I’ve been playing with chatGPT lately, and recently started experimenting with it through the OpenAI API using python. Since I’m relatively new to AI concepts, I’ve been reading up a bit to try and get up to speed.

As I did that, I realized that there are 4 different models OpenAI exposes through the API: ada, babbage, curie, and davinci. While it was easy enough to understand that the models offer different levels of complexity, it wasn’t at all clear to me what that really means in a practical sense.

I wanted to understand better the difference between the models and the effect the selection of model has on the output. This can be done using the GPTools Comparison Tool, but I wanted something I could control more fully.

It turns out that doing this is a pretty straight forward process with python: you can simply just run the exact same prompt with the same settings, and change only the model to see how different things are. So, I set up the following code for this task — it takes a prompt and runs it through each model with a temperature of 0 (to limit the model to the least risk). The response text is output with the name of the model that generated it.

#!/bin/env python3
import openai
import os

openai.api_key = os.getenv("OPENAI_API_KEY")

# =====================
# models
# =====================
# text-ada-001
# text-babbage-001
# text-curie-001
# text-davinci-003
# =====================
prompt = "complete this sentence: \"This is...\""
temperature = 0
max_tokens = 150
top_p = 1
frequency_penalty = 1
presence_penalty = 1

ada_response = openai.Completion.create(
  model="text-ada-001",
  prompt=prompt,
  temperature=temperature,
  max_tokens=max_tokens,
  top_p=top_p,
  frequency_penalty=frequency_penalty,
  presence_penalty=presence_penalty
)
print("ada:%s\n\n" % ada_response.choices[0].text)

babbage_response = openai.Completion.create(
  model="text-babbage-001",
  prompt=prompt,
  temperature=temperature,
  max_tokens=max_tokens,
  top_p=top_p,
  frequency_penalty=frequency_penalty,
  presence_penalty=presence_penalty
)
print("babbage:%s\n\n" % babbage_response.choices[0].text)

curie_response = openai.Completion.create(
  model="text-curie-001",
  prompt=prompt,
  temperature=temperature,
  max_tokens=max_tokens,
  top_p=top_p,
  frequency_penalty=frequency_penalty,
  presence_penalty=presence_penalty
)
print("curie:%s\n\n" % curie_response.choices[0].text)

davinci_response = openai.Completion.create(
  model="text-davinci-003",
  prompt=prompt,
  temperature=temperature,
  max_tokens=max_tokens,
  top_p=top_p,
  frequency_penalty=frequency_penalty,
  presence_penalty=presence_penalty
)
print("davinci:%s\n\n" % davinci_response.choices[0].text)

Running the code above with the sample prompt included in it doesn’t really help much with understanding the differences between the models (although it does sort of highlight how hilariously snarky davinci can appear to be at times):

ada:
This is a house.

babbage:
This is a book.

curie:
This is a pen.

davinci:
This is a sentence.

One reason it’s tough to see a difference is that the task itself is relatively simple, so there’s not much distinction the amount of training the model has had can offer in this instance.

Giving the AI a more complex task, like brainstorm 5 slogans for an airport. for example, and the difference becomes significantly clearer:

ada:
1. airport is the place to be for a getaway car
2. airport is the place to be when your just looking for some peace and quiet
3. airport is the place to be when you need it the most
4. airport is the place to be for a break from the everyday routine
5. airport is the place to be for a chance against time

babbage:
1. "Welcome to the airport!"
2. "The best airport in the world!"
3. "A place where you can be yourself!"
4. "The perfect place to fly!"
5. "The best airport in the world for travel."

curie:
1. Welcome to the Airport of the Future!
2. We're Making Flying Easy Again!
3. Fly with Confidence, We've Got You Covered!
4. The Best Airport in the World is Right Here at Our Airports!
5. Let's Go Places - With Our Airports by Your Side

davinci:
1. "Fly with Us and Soar to New Heights!"
2. "Your Gateway to the World"
3. "Where Adventure Takes Off"
4. "The Sky's the Limit at Our Airport"
5. "Take Flight with Us!"

In this case, the creativity between the different models is much clearer.

A couple of things I noticed going through this exercise: 

  1. For some reason, the curie model doesn’t wrap it’s output in quotes where the other models do. 
  2. Running the same prompt repeatedly resulted in the exact same responses from the models. It makes sense that this would be the case given a temperature of 0 I suppose, but I’d sort of expected that the results would be different each time I ran the prompt based on my interactions with chatGPT through the website.

Overall, this helped me understand much better how the different models present very different results in some circumstances, and helped me also understand where maybe they don’t make much of a difference. Since there is a pretty significant cost difference between the models, it’s good to know the latter bit, so I can use the least expensive model to do the types of tasks where the responses don’t differ much between the models.