# 2.1 Literature Review

The integration of Generative Artificial Intelligence (GenAI) into software engineering represents a paradigm shift comparable to the introduction of high-level programming languages or integrated development environments (IDEs). This literature review synthesizes current research regarding the adoption, impact, and challenges of GenAI within professional software development workflows. The review is organized into five primary sections: theoretical frameworks governing human-AI interaction in engineering, the evolution from code completion to autonomous agents, impacts on developer productivity and collaboration, quality assurance mechanisms, and the emerging critical landscape of security and governance.

## 2.1.1 Theoretical Frameworks in AI-Augmented Engineering

To understand the impact of GenAI on software development, it is necessary to ground the analysis in established theoretical frameworks that describe the interaction between human cognition and computational tools. The transition from manual coding to AI-assisted development necessitates a re-evaluation of Human-Centered Software Engineering (HCSE).

### 2.1.1.1 Human-Centered Software Engineering (HCSE)
Historically, software engineering models focused primarily on process optimization and architectural integrity. However, Seffah et al. {cite_019} established the foundational importance of HCSE, arguing that software architectures must account for the cognitive patterns and limitations of the humans interacting with them. In the context of GenAI, this framework is resurgent. The cognitive load of a developer is shifting from "synthesizing logic" (writing code) to "evaluating logic" (reviewing AI output).

This shift aligns with recent investigations into the "synthetic pair programmer" phenomenon. As noted in recent empirical studies, the introduction of AI tools alters the collaborative dynamics of teams, effectively placing the AI in the role of a junior developer or peer {cite_004}. The theoretical implication is that the "user" in HCSE is no longer just the end-user of the software product, but the developer themselves, whose user experience (UX) with the AI tool directly dictates software quality.

### 2.1.1.2 Trust and Adoption Models
The successful integration of AI into high-stakes engineering environments depends heavily on trust. Barón {cite_015} proposes an adoption framework specifically designed to foster trust in AI-assisted software engineering (AIASE). This framework suggests that trust is not binary but multidimensional, contingent upon:
1.  **Explainability:** Can the developer understand why the AI suggested a specific pattern?
2.  **Reliability:** Does the tool perform consistently across different contexts?
3.  **Transparency:** Is the provenance of the generated code clear?

Without these theoretical pillars, adoption remains superficial. Developers may use tools for trivial tasks while rejecting them for critical architectural decisions due to a "trust deficit." This aligns with findings by Esposito et al. {cite_014}, who argue for an Evidence-Based Software Engineering (EBSE) approach to GenAI, where adoption is driven not by hype but by empirical validation of the tool's efficacy and safety.

## 2.1.2 The Evolution of Coding Assistants

The technology driving AI-augmented software engineering has evolved rapidly, moving from simple statistical text prediction to complex, context-aware reasoning engines.

### 2.1.2.1 From Autocomplete to Conversational Context
Early iterations of coding assistants relied on N-gram models and simple heuristics. The advent of Large Language Models (LLMs) fundamentally changed this landscape. Reddy Vootukuri {cite_006} highlights the capabilities of tools like GitHub Copilot Chat, which integrate directly into the developer's workflow. Unlike previous tools that required context switching (e.g., searching Stack Overflow), modern assistants maintain the context of the IDE, allowing for "in-flow" information retrieval and code generation.

Arora {cite_008} describes this as a transformation in developer productivity, moving beyond simple syntax completion to semantic understanding. The AI can infer intent from comments, variable names, and project structure, thereby reducing the cognitive friction associated with translating abstract requirements into concrete syntax.

### 2.1.2.2 Agentic Architectures and Autonomy
A significant divergence in the literature exists between "assistants" (which wait for user input) and "agents" (which autonomously pursue goals). Xia et al. {cite_022} present a critical analysis of LLM-based software engineering agents in their work on "Agentless." They distinguish between complex, multi-step agentic frameworks and simpler, more direct LLM interactions. Their findings suggest that while autonomous agents promise to handle complex tasks like "fix this bug" without human intervention, the complexity of managing the agent's state often yields diminishing returns compared to simpler, well-prompted LLM calls.

Conversely, Zhu and Kang {cite_020} introduce "UTBoost," a rigorous evaluation of coding agents on benchmarks like SWE-Bench. Their work demonstrates that for agents to be effective, they require robust execution environments where they can run code, analyze errors, and iterate—a process mimicking the human "trial and error" loop. This defines the current frontier of the field: the transition from AI that *writes* code to AI that *engineers* solutions through iterative testing.

## 2.1.3 Impact on Productivity and Workflow

The primary driver for industry adoption of GenAI is the promise of increased productivity. However, defining and measuring this productivity remains a complex research challenge.

### 2.1.3.1 Quantitative and Objective Measures
Traditional metrics such as Lines of Code (LOC) or commit frequency are insufficient for measuring AI-augmented productivity, as AI can generate high volumes of low-quality code. Brandebusemeyer {cite_017} introduces a novel methodological approach using wearables to measure developer experience and productivity objectively. By tracking physiological signals (e.g., heart rate variability, electrodermal activity), researchers can infer cognitive load and flow states. This represents a significant methodological advance, moving assessment away from self-reported surveys toward biometric data.

Table 1 summarizes different approaches to measuring productivity in the analyzed literature.

| Measurement Approach | Key Metrics | Advantages | Limitations | Source |
|----------------------|-------------|------------|-------------|--------|
| **Biometric/Physiological** | HRV, EDA, Stress levels | Objective, real-time cognitive load data | Privacy concerns, hardware requirements | {cite_017} |
| **Empirical/Output-Based** | Task completion time, Pass rates | Direct correlation to business value | Ignores code maintainability/quality | {cite_020} |
| **Socio-Technical** | Collaboration patterns, mentorship needs | Captures team dynamics | Hard to quantify, subjective | {cite_004} |
| **Perceptual/Survey** | Developer satisfaction, perceived velocity | Easy to collect, captures "happiness" | Subject to bias and placebo effects | {cite_007} |

*Table 1: Comparative Analysis of Productivity Measurement Methodologies in AI-Assisted Engineering.*

Smit et al. {cite_007} analyze GitHub Copilot's impact through the lens of the Software Engineering Body of Knowledge (SWEBOK). Their findings suggest that productivity gains are non-uniform; they are highest in "construction" and "testing" phases but potentially negative in "requirements" and "maintenance" if the AI generates subtle bugs that are difficult to detect.

### 2.1.3.2 Qualitative Shifts in Collaborative Dynamics
The introduction of AI tools fundamentally alters how teams interact. Ulfsnes et al. {cite_004} provide empirical insights showing that GenAI tools act as a "synthetic pair programmer." This has dual implications:
1.  **Reduction in Mentorship:** Senior developers spend less time answering syntax questions for juniors, as the AI handles these queries.
2.  **Isolation Risk:** There is a potential risk of "siloing," where developers interact more with the AI than with their peers, potentially eroding the shared mental model of the system architecture.

Lakshmi et al. {cite_013} argue that this redefinition of software development requires new management strategies. The role of the developer is evolving from a "writer" of code to an "orchestrator" of AI services, necessitating a shift in skills from syntax memorization to system design and prompt engineering.

## 2.1.4 Quality Assurance and Code Review

As the volume of generated code increases, the bottleneck in the software lifecycle shifts to Quality Assurance (QA) and Code Review.

### 2.1.4.1 Automated Pull Request Analysis
One of the most immediate applications of LLMs is in the administrative aspects of code review. Zuo et al. {cite_001} conducted an empirical study on the potential of LLMs to automatically generate Pull Request (PR) titles. Their research indicates that LLMs can summarize code changes with high accuracy, reducing the administrative burden on developers. This is not merely a convenience; accurate PR descriptions are critical for repository maintainability and historical tracking.

Furthermore, Balachandran and Fawzer {cite_040} explore "context-aware code review," where GenAI integrates into the CI/CD pipeline to analyze PRs not just for syntax errors, but for logic flaws and adherence to coding standards. This automated "first pass" allows human reviewers to focus on architectural implications rather than stylistic nits.

### 2.1.4.2 Reliability and Hallucination Risks
Despite the promise of automation, reliability remains a primary concern. Cihan et al. {cite_041} discuss automated code review in practice, highlighting that while tools like Qodo and GitHub Copilot can suggest improvements, they suffer from "hallucinations"—confidently stating incorrect information.

The risk is amplified when the AI is used to generate test cases. If an AI generates both the code and the test case, it may introduce a "tautological error" where the test passes because it asserts the incorrect logic implemented in the code. Ali and Yue {cite_031}, in their formalization of ISO/IEC/IEEE 29119, emphasize that testing standards must be rigorous. The introduction of AI-generated tests requires a higher standard of validation, effectively "testing the tester."

## 2.1.5 Security, Governance, and Supply Chain Implications

The widespread use of GenAI introduces novel attack vectors and compliance challenges, necessitating a robust governance framework.

### 2.1.5.1 Vulnerabilities in AI-Generated Code
A critical emerging threat is the contamination of the knowledge base used by developers. Swaraj et al. {cite_009} investigate "adversarial prompted AI-generated code" on platforms like Stack Overflow. Their benchmark dataset reveals that malicious actors can manipulate AI models (or the prompts fed to them) to generate code that looks functional but contains hidden vulnerabilities. This "poisoning" of the developer ecosystem is a significant risk, as developers often trust highly-rated solutions implicitly.

### 2.1.5.2 Regulatory Standards and Compliance
To mitigate these risks, the industry is turning to formal standards. The ISO/IEC 42001:2023 standard has emerged as a central framework for AI management systems. Seet {cite_032} and Biroğul et al. {cite_033} explore the legal and organizational impacts of this standard. ISO 42001 mandates:
*   **Risk Assessment:** Continuous evaluation of AI models for bias and safety.
*   **Accountability:** Clear lines of responsibility for AI-generated decisions.
*   **Transparency:** Documentation of model training data and limitations.

In the context of the software supply chain, Shukla {cite_034} discusses AI-driven Software Bill of Materials (SBOM) management. As software becomes a composite of human-written, open-source, and AI-generated code, tracking the provenance of every component becomes nearly impossible without automated tools. However, AI can also be the solution; Shukla proposes using AI to automatically generate and maintain SBOMs, ensuring compliance with security standards.

Table 2 outlines the security challenges and corresponding mitigation strategies identified in the literature.

| Security Domain | Identified Threat | Mitigation Strategy | Standard/Framework |
|-----------------|-------------------|---------------------|--------------------|
| **Code Integrity** | Adversarial prompting, vulnerable code generation | Enhanced detection benchmarks, human-in-the-loop review | {cite_009} |
| **Supply Chain** | Opaque dependencies, lack of provenance | AI-driven SBOM generation, Blockchain reproducibility | {cite_034}, {cite_037} |
| **Compliance** | Lack of accountability, legal liability | ISO 42001 implementation, AI Management Systems | {cite_032}, {cite_033} |
| **Data Privacy** | Leaking proprietary code to public models | Localized model deployment, Privacy-preserving architectures | {cite_025} |

*Table 2: Security Threats and Governance Frameworks in AI-Augmented Software Engineering.*

Syed {cite_036} and Aideyan et al. {cite_037} further extend this to critical systems, such as automotive software. Aideyan et al. propose a blockchain-reproducible build approach to secure the supply chain, which is particularly relevant when AI tools generate code that is deployed via Over-The-Air (OTA) updates to vehicles.

## 2.1.6 Research Gaps

While the literature is expanding rapidly, significant gaps remain that this thesis aims to address.

**1. Longitudinal Impact on Skill Acquisition:**
Most studies, such as those by Zuo et al. {cite_001} and Brandebusemeyer {cite_017}, focus on immediate productivity or task completion. There is a paucity of longitudinal research on how reliance on GenAI affects the skill acquisition of junior developers over time. If the AI handles the "struggle" of learning, does deep expertise develop?

**2. Socio-Technical Nuance in Enterprise Environments:**
While Ulfsnes et al. {cite_004} touch on collaboration, there is limited deep ethnographic work on how GenAI changes the *culture* of large enterprise software teams. Specifically, how does it affect the psychological safety of code reviews?

**3. Integration of Design and Engineering:**
Wang {cite_028} discusses generative AI in the context of CMF (Color, Material, Finish) design for smart cabins. However, the intersection of *software* design (architecture) and GenAI is under-explored. Most literature focuses on the implementation phase (coding) rather than the architectural design phase.

**4. The "Agentic" Gap:**
As noted by Xia et al. {cite_022}, there is a disconnect between the promise of autonomous agents and their practical reliability. Research is needed to bridge the gap between "demo-ware" agents and production-ready engineering bots that can be trusted with write-access to repositories.

By synthesizing these diverse streams of research—from biometric productivity tracking to formal ISO standards—this review establishes the complexity of the current landscape. The integration of GenAI is not merely a tool upgrade; it is a systemic transformation of the engineering discipline, requiring new theories, new metrics, and new governance models.

## 2.1.7 Mathematical and Methodological Considerations in Evaluation

To rigorously evaluate the performance of GenAI in software engineering, researchers have moved beyond qualitative assessments to incorporate specific mathematical metrics. This is particularly evident in studies benchmarking code generation and detection.

### 2.1.7.1 Evaluation Metrics for Code Generation
In evaluating the efficacy of coding agents, Zhu and Kang {cite_020} utilize the SWE-Bench framework. A critical metric in this domain is the **Pass@k** metric, which estimates the probability that at least one of the top $k$ generated code samples passes the unit tests.

The formula for Pass@k is defined as:

$$Pass@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$$

Where:
*   $n$ is the total number of samples generated.
*   $c$ is the number of correct samples (those that pass all tests).
*   $k$ is the number of samples selected for evaluation.

This metric is crucial because LLMs are probabilistic; a single generation may be flawed, but generating multiple variations often yields a correct solution. Understanding this probability distribution is essential for integrating AI into automated pipelines where human verification of every sample is not feasible.

### 2.1.7.2 Metrics for Detecting AI-Generated Code
In the domain of security and academic integrity, distinguishing between human-written and AI-generated code is paramount. Swaraj et al. {cite_009} employ standard classification metrics to evaluate their detection approaches. The **F1-Score**, the harmonic mean of precision and recall, is the standard for these imbalanced datasets:

$$F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$$

Where:
$$Precision = \frac{TP}{TP + FP}$$
$$Recall = \frac{TP}{TP + FN}$$

Swaraj et al. demonstrate that as AI models improve, the distribution of features in generated code converges with human code, causing the F1-scores of traditional detectors to degrade. This necessitates the development of more sophisticated, feature-rich detection algorithms that analyze not just syntax, but the semantic structure and "perplexity" of the code.

The inclusion of these mathematical frameworks in the literature underscores the field's maturation from exploratory qualitative studies to rigorous quantitative science. It highlights that "productivity" and "quality" in the AI era are not vague sentiments but quantifiable variables that must be measured against probabilistic baselines.

This review of the literature confirms that while the capabilities of GenAI in software engineering are immense, they are matched by significant challenges in verification, security, and human factors. The subsequent sections of this thesis will build upon these findings, specifically investigating the identified gap in longitudinal skill acquisition and enterprise workflow integration.