!uv pip install llama-cpp-pythonUsing Python 3.13.1 environment at: /Users/simon/Dev/CS-394/.venv Resolved 6 packages in 110ms Installed 2 packages in 4ms3.16 + diskcache==5.6.3 + llama-cpp-python==0.3.16
Note: Before running this notebook, you should follow README.md to first download the GGUF model.
!uv pip install llama-cpp-pythonUsing Python 3.13.1 environment at: /Users/simon/Dev/CS-394/.venv Resolved 6 packages in 110ms Installed 2 packages in 4ms3.16 + diskcache==5.6.3 + llama-cpp-python==0.3.16
from llama_cpp import Llama
GGUF_MODEL = f"../code/gguf/gemma-3-1b-it-Q4_K_M.gguf"
llm = Llama(
model_path=GGUF_MODEL,
chat_format="gemma"
)llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": "Hello"
}
]
)llama_perf_context_print: load time = 205.62 ms
llama_perf_context_print: prompt eval time = 205.11 ms / 10 tokens ( 20.51 ms per token, 48.76 tokens per second)
llama_perf_context_print: eval time = 394.64 ms / 32 runs ( 12.33 ms per token, 81.09 tokens per second)
llama_perf_context_print: total time = 611.16 ms / 42 tokens
llama_perf_context_print: graphs reused = 30
{'id': 'chatcmpl-2887bb2b-b1a5-428d-a42b-46c83a97fb6b',
'object': 'chat.completion',
'created': 1770147671,
'model': '../code/gguf/gemma-3-1b-it-Q4_K_M.gguf',
'choices': [{'index': 0,
'message': {'role': 'assistant',
'content': "Hello there! How's your day going so far? 😊 \n\nIs there anything you'd like to chat about, or need any help with?"},
'logprobs': None,
'finish_reason': 'stop'}],
'usage': {'prompt_tokens': 10, 'completion_tokens': 32, 'total_tokens': 42}}