Skip to main content
OpenDraft
Back to Examples

OpenDraft AI

AI-Generated Draft

50 min read
Example

Example Draft: Quantization of Large Language Models for Integer-Only Hardware

This is a thesis-level research draft generated by OpenDraft

Generated in 51.1 minutes with 45+ verified academic citations from CrossRef, Semantic Scholar, and IEEE. Zero hallucinated references.

Generation Statistics

51.1
Minutes
45+
Citations
~24K
Words
286KB
PDF Size

Download Draft

Thesis Topic

Quantization of Large Language Models for Integer-Only Hardware

This master's thesis provides a comprehensive analysis of quantization techniques for deploying Large Language Models (LLMs) on integer-only hardware accelerators. The research examines INT8, INT4, and mixed-precision approaches, evaluating accuracy-efficiency tradeoffs across different model architectures.

Abstract

The deployment of Large Language Models on edge devices and specialized hardware accelerators presents significant challenges due to the computational and memory requirements of full-precision inference. This thesis investigates quantization as a model compression technique that enables efficient LLM deployment on integer-only hardware without floating-point units.

We systematically evaluate post-training quantization (PTQ) and quantization-aware training (QAT) approaches, comparing their effectiveness across transformer architectures of varying scales. Our analysis reveals that while INT8 quantization preserves model accuracy within 1% degradation, INT4 quantization requires careful calibration and often benefits from mixed-precision strategies for attention mechanisms.

Key Research Questions

  • RQ1: What is the minimum bit-width that preserves acceptable accuracy for different LLM tasks?
  • RQ2: How do PTQ and QAT compare in terms of accuracy recovery for integer-only deployment?
  • RQ3: What are the optimal mixed-precision configurations for attention vs. feed-forward layers?
  • RQ4: How does quantization affect different model sizes (7B, 13B, 70B parameters)?

Key Contributions

  • Comprehensive benchmark of quantization methods across multiple LLM architectures
  • Analysis of hardware-aware quantization for ARM, NVIDIA, and custom NPU targets
  • Practical guidelines for INT4/INT8 deployment with accuracy guarantees
  • Novel mixed-precision strategy optimizing accuracy-latency Pareto frontier

Want to generate your own thesis?

OpenDraft can generate research-quality academic drafts with verified citations in under 30 minutes.Get started free

Related Examples