Reasoning/Thinking Models
Retrieval-Augmented Generation (RAG)
- RAG Paper - “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al., 2020)
Model Evaluation Benchmarks
- MMLU-Pro on Hugging Face - 12K multi-choice questions across 14 subjects testing factual knowledge and reasoning
- GPQA on Hugging Face - Graduate/PhD-level scientific reasoning in biology, physics, and chemistry
- SWE-Bench - AI systems solving real-world software engineering tasks from GitHub issues
- HLE on Hugging Face - 2,500 expert-level questions requiring multimodal, multi-step reasoning