Using llama.cpp Binding for Python

Note: Before running this notebook, you should follow README.md to first download the GGUF model.

Install the llama-cpp-python binding

!uv pip install llama-cpp-python

Using Python 3.13.1 environment at: /Users/simon/Dev/CS-394/.venv

Resolved 6 packages in 110ms                                         

Installed 2 packages in 4ms3.16                             

 + diskcache==5.6.3

 + llama-cpp-python==0.3.16

Load the local gguf model

from llama_cpp import Llama

GGUF_MODEL = f"../code/gguf/gemma-3-1b-it-Q4_K_M.gguf"

llm = Llama(
      model_path=GGUF_MODEL,
      chat_format="gemma"
)

Chat with the model using chat completion API

llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are a helpful assistant."},
          {
              "role": "user",
              "content": "Hello"
          }
      ]
)

llama_perf_context_print:        load time =     205.62 ms
llama_perf_context_print: prompt eval time =     205.11 ms /    10 tokens (   20.51 ms per token,    48.76 tokens per second)
llama_perf_context_print:        eval time =     394.64 ms /    32 runs   (   12.33 ms per token,    81.09 tokens per second)
llama_perf_context_print:       total time =     611.16 ms /    42 tokens
llama_perf_context_print:    graphs reused =         30

{'id': 'chatcmpl-2887bb2b-b1a5-428d-a42b-46c83a97fb6b',
 'object': 'chat.completion',
 'created': 1770147671,
 'model': '../code/gguf/gemma-3-1b-it-Q4_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "Hello there! How's your day going so far? 😊 \n\nIs there anything you'd like to chat about, or need any help with?"},
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 10, 'completion_tokens': 32, 'total_tokens': 42}}