OpenDraft AI
AI-Generated Draft
Example Draft: Quantization of Large Language Models for Integer-Only Hardware
This is a thesis-level research draft generated by OpenDraft
Generated in 51.1 minutes with 45+ verified academic citations from CrossRef, Semantic Scholar, and IEEE. Zero hallucinated references.
Generation Statistics
Download Draft
Thesis Topic
Quantization of Large Language Models for Integer-Only Hardware
This master's thesis provides a comprehensive analysis of quantization techniques for deploying Large Language Models (LLMs) on integer-only hardware accelerators. The research examines INT8, INT4, and mixed-precision approaches, evaluating accuracy-efficiency tradeoffs across different model architectures.
Abstract
The deployment of Large Language Models on edge devices and specialized hardware accelerators presents significant challenges due to the computational and memory requirements of full-precision inference. This thesis investigates quantization as a model compression technique that enables efficient LLM deployment on integer-only hardware without floating-point units.
We systematically evaluate post-training quantization (PTQ) and quantization-aware training (QAT) approaches, comparing their effectiveness across transformer architectures of varying scales. Our analysis reveals that while INT8 quantization preserves model accuracy within 1% degradation, INT4 quantization requires careful calibration and often benefits from mixed-precision strategies for attention mechanisms.
Key Research Questions
- RQ1: What is the minimum bit-width that preserves acceptable accuracy for different LLM tasks?
- RQ2: How do PTQ and QAT compare in terms of accuracy recovery for integer-only deployment?
- RQ3: What are the optimal mixed-precision configurations for attention vs. feed-forward layers?
- RQ4: How does quantization affect different model sizes (7B, 13B, 70B parameters)?
Key Contributions
- Comprehensive benchmark of quantization methods across multiple LLM architectures
- Analysis of hardware-aware quantization for ARM, NVIDIA, and custom NPU targets
- Practical guidelines for INT4/INT8 deployment with accuracy guarantees
- Novel mixed-precision strategy optimizing accuracy-latency Pareto frontier
Want to generate your own thesis?
OpenDraft can generate research-quality academic drafts with verified citations in under 30 minutes.Get started free