# Benchmarking of Generative AI Tools in Software Engineering Education: Formative Insights for Curriculum Integration

**Authors:** Roy, Horielko, Omojokun
**Year:** 2025
**DOI:** 10.1145/3702653.3744328
**URL:** https://doi.org/10.1145/3702653.3744328

## Abstract

Generative Artificial Intelligence (Gen-AI) has revolutionized software engineering (SE) by automating tasks across design, coding, and testing [1] [2]. Tools like ChatGPT and GitHub Copilot streamline code generation, architectural modeling, debugging, and test-case creation [3] [4]. Despite their rapid adoption in industry, the pedagogical implications of these tools in computing education have not been systematically examined. This study solves the existing gap by conducting a comprehensive benchmarking study of Gen-AI tools across four core SE phases— design documentation, feature implementation, debugging support, and testing — to address two research questions: RQ1: What strengths and limitations do Gen-AI tools exhibit in each phase? RQ2: How can insights from benchmarking inform effective integration of Gen-AI into SE curricula? To answer these questions, a diverse set of Gen-AI tools is evaluated, ranging from design-focused assistants such as Lucidchart, Mermaid.js and UIzard; implementation-oriented systems including GitHub Copilot, TabNine, Codeium and Supermaven; debugging supports like GPT-4 and Claude 3.5 Sonnet; and testing frameworks such as Testim, Mabl and Applitools—while also surveying emerging platforms (as of summer 2024) like Replit, Postman, Visily, Gemini, Eraser.io and others. For each tool and development phase, we applied phase-specific metrics: in design documentation, we assessed diagram accuracy, completeness, user effort, and IDE integration; in feature implementation, we measured pattern-based code generation quality, code-completion effectiveness, refactoring robustness, and UI/UX scaffolding; in debugging, we evaluated error-detection accuracy, hallucination rates, and clarity of explanatory feedback; and in testing, we examined test-case relevance and defect-detection coverage. Across all phases, we tracked prompt engineering complexity as a key mediating factor influencing tool performance. Our evaluation reveals speed-fidelity trade-offs: Code-completion assistants accelerate boilerplate generation but demand manual oversight to ensure cross-file consistency and manage higher-order abstractions; diagramming tools can produce precise UML models with minimal effort— but at the cost of iterative prompt refinement for complex cases; LLM debuggers deliver context-sensitive fixes yet suffer from nontrivial hallucination rates; testing generators exhibit wide variance in edge-case coverage. On average, tools needed 2.4 prompt iterations for usable diagrams and 1.5 prompts for bug fixes, underscoring the human effort in guiding AI. We recommend a scaffolded framework for integrating Gen-AI into SE education by: embedding AI tools into hands-on assignments, to explore tasks in a controlled context; by structuring small team projects in which one subgroup uses AI assistants while the other completes the same tasks manually (covering design, implementation, debugging and testing) to surface contrasts in workflow, tool strengths, and human reasoning; by requiring students to maintain a reflective journal documenting their AI usage and prompt-engineering strategies, fostering metacognitive insight into how tool inputs shape outputs; and by equipping learners with decision making criteria, teaching them to evaluate AI assistants according to task fit- preparing them to leverage AI responsibly across SE phases in its evolving landscape.

## Citation Details

Benchmarking of Generative AI Tools in Software Engineering Education: Formative Insights for Curriculum Integration
**Authors**: Roy, Horielko, Omojokun
**Year**: 2025
**DOI**: 10.1145/3702653.3744328
**URL**: https://doi.org/10.1145/3702653.3744328

**Abstract**: Generative Artificial Intelligence (Gen-AI) has revolutionized software engineering (SE) by automating tasks across design, coding, and testing [1] [2]. Tools like ChatGPT and GitHub Copilot streamline code generation, architectural modeling, debugging, and test-case creation [3] [4]. Despite their rapid adoption in industry, the pedagogical implications of these tools in computing education have not been systematically examined. This study solves the existing gap by conducting a comprehensive benchmarking study of Gen-AI tools across four core SE phases— design documentation, feature implementation, debugging support, and testing — to address two research questions: RQ1: What strengths and limitations do Gen-AI tools exhibit in each phase? RQ2: How can insights from benchmarking inform effective integration of Gen-AI into SE curricula? To answer these questions, a diverse set of Gen-AI tools is evaluated, ranging from design-focused assistants such as Lucidchart, Mermaid.js and UIzard; implementation-oriented systems including GitHub Copilot, TabNine, Codeium and Supermaven; debugging supports like GPT-4 and Claude 3.5 Sonnet; and testing frameworks such as Testim, Mabl and Applitools—while also surveying emerging platforms (as of summer 2024) like Replit, Postman, Visily, Gemini, Eraser.io and others. For each tool and development phase, we applied phase-specific metrics: in design documentation, we assessed diagram accuracy, completeness, user effort, and IDE integration; in feature implementation, we measured pattern-based code generation quality, code-completion effectiveness, refactoring robustness, and UI/UX scaffolding; in debugging, we evaluated error-detection accuracy, hallucination rates, and   # Truncate if very long

---
*Extracted from citation research database - all 45 citations available*
