Resources

llama.cpp

  • llama.cpp on GitHub - C/C++ library for local LLM inference with CLI, Web UI, and OpenAI-compatible server
  • llama.cpp Bindings - List of community bindings for various languages and platforms

llama.cpp Wrappers

  • Ollama - Simple CLI wrapper around llama.cpp with a curated model library
  • LM Studio - Desktop GUI for browsing, downloading, and running quantized models from Hugging Face

llama.cpp Bindings

  • llama-cpp-python - Python binding with OpenAI-like API, supporting chat completions, tool calling, and multimodal models
  • LLamaSharp - C# binding for llama.cpp, installable via NuGet with CPU, CUDA, and Vulkan backends

Quantization

ML Frameworks

Browser-based Inference

  • Wllama - Run GGUF models in the browser using WebAssembly (CPU with SIMD)
  • WebLLM - Run LLMs in the browser using WebGPU for GPU-accelerated inference

Mixture of Experts (MoE)

Citations