Resources

llama.cpp

llama.cpp on GitHub - C/C++ library for local LLM inference with CLI, Web UI, and OpenAI-compatible server
llama.cpp Bindings - List of community bindings for various languages and platforms

llama.cpp Wrappers

Ollama - Simple CLI wrapper around llama.cpp with a curated model library
LM Studio - Desktop GUI for browsing, downloading, and running quantized models from Hugging Face

llama.cpp Bindings

llama-cpp-python - Python binding with OpenAI-like API, supporting chat completions, tool calling, and multimodal models
LLamaSharp - C# binding for llama.cpp, installable via NuGet with CPU, CUDA, and Vulkan backends

Quantization

GPTQ Paper - “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”
Unsloth AI on Hugging Face - Pre-quantized versions of popular models

ML Frameworks

MLX on GitHub - Apple’s ML framework for Apple Silicon with NumPy-like API
MLX Models on Hugging Face - Community-hosted MLX-quantized models
ONNX Models on Hugging Face - Models in ONNX interchange format for broad portability

Browser-based Inference

Wllama - Run GGUF models in the browser using WebAssembly (CPU with SIMD)
WebLLM - Run LLMs in the browser using WebGPU for GPU-accelerated inference

Mixture of Experts (MoE)

Switch Transformers Paper - “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”
Outrageously Large Neural Networks Paper - “The Sparsely-Gated Mixture-of-Experts Layer”

Citations

References Slide