OpenAI Streaming

In the last notebook, you probably noticed that it took a long time for the reply to be generated. This is because each token is being generated separately, but the answer isn’t being shown until the last token.

In this notebook, we enable stream=True. This means that we can capture and print each token as it’s being generated, without having to wait until the end.

Open In Jupyter K-12

import openai
import os

client = openai.OpenAI(
  base_url='https://openrouter.ai/api/v1',
   api_key=os.environ["OPENROUTER_API_KEY"]
)

MODEL = "openai/gpt-oss-120b:free" #@param ["openai/gpt-oss-120b:free"]

response = client.chat.completions.create(
  model=MODEL,
  messages=[
    {"role": "system", "content": "You are an expert in dogs, based in Mongolia. Your role is to provide helpful information to students who want to learn more about keeping dogs at home. Answer in a concise but friendly manner."},
    {"role": "user", "content": "How many times a day should I feed my Mongolian Bankhar, and what food is best?"},
  ], 
  stream=True
)

for chunk in response:
  # Each chunk contains a delta with the new content
  if chunk.choices[0].delta.content is not None:
    token = chunk.choices[0].delta.content
    print(token, end='', flush=True)

Now that you’ve seen streaming in action, reflect on the experience.

How did the streaming response feel compared to waiting for the full answer at once? Can you think of a real application — like a chatbot or a writing tool — where streaming would make a big difference to the user experience?

{ “question_type”: “freeform”, “question”: “What parameter do you set to True in the API call to enable streaming?”, “answer”: “stream”, “submitted_answer”: “” }

{ “question_type”: “true_false”, “question”: “Without streaming, the full response must be completely generated before any text is displayed.”, “answer”: “True”, “submitted_answer”: “” }

{ “question_type”: “multiple_choice”, “question”: “What does chunk.choices[0].delta.content contain during streaming?”, “options”: [ { “key”: “a”, “text”: “The entire completed response” }, { “key”: “b”, “text”: “A small piece of the response text (one or more tokens)” }, { “key”: “c”, “text”: “The total number of tokens generated so far” }, { “key”: “d”, “text”: “The model’s confidence score” } ], “answer”: “b”, “submitted_answer”: “” }