## 2.2 Methodology

This chapter details the methodological approach employed to investigate the socio-technical impact of Generative AI (GenAI) on professional software development workflows. Given the rapid evolution of this domain, where empirical practices often outpace academic publication cycles, this thesis adopts a **narrative review** framework. This approach allows for a comprehensive synthesis of diverse evidence sources—ranging from rigorous empirical studies and technical benchmarks to industry white papers and emerging standards—to construct a holistic understanding of the current state of the art.

The following sections outline the research design, data collection strategies, and analytical frameworks utilized to evaluate the selected literature. Furthermore, this chapter analyzes the methodological diversity found within the primary sources themselves, categorizing how the field currently measures productivity, quality, and human factors in AI-augmented software engineering.

### 2.2.1 Research Design and Review Strategy

The primary objective of this research is to move beyond simple performance metrics of Large Language Models (LLMs) and investigate their integration into complex human workflows. To achieve this, a narrative review design was selected over a systematic review (e.g., PRISMA) due to the heterogeneous nature of the available literature and the necessity of including non-traditional academic sources such as industry standards (ISO/IEC) and technical reports which are pivotal in this specific domain.

#### 2.2.1.1 Search Strategy and Data Collection
Academic sources were identified through targeted searches of major digital libraries, including IEEE Xplore, ACM Digital Library, SpringerLink, and arXiv. The search strategy prioritized recent publications (2023–2025) to capture the impact of modern LLMs (e.g., GPT-4, Copilot), though seminal works on human-centered software engineering were included to provide theoretical grounding.

The search process utilized a combination of keywords related to three core dimensions: technology (Generative AI, LLMs), domain (Software Engineering, DevOps), and outcome (Productivity, Workflow, Security).

| Dimension | Key Search Terms | Rationale |
|-----------|------------------|-----------|
| Technology | Generative AI, LLM, Copilot, Agents | Captures specific tools and general models |
| Domain | Software Engineering, Code Review, CI/CD | Focuses on professional workflows |
| Outcome | Productivity, Developer Experience, Trust | Addresses socio-technical impacts |
| Governance | ISO 42001, SBOM, Compliance | Addresses regulatory frameworks |

*Table 1: Search Strategy Dimensions and Keywords. The selection focused on the intersection of these three dimensions to ensure relevance.*

#### 2.2.1.2 Inclusion and Exclusion Criteria
Sources were selected based on their contribution to understanding the *application* of AI in professional settings rather than the *architecture* of the models themselves.

**Inclusion Criteria:**
*   Peer-reviewed conference papers and journal articles focusing on AI in software engineering (AI4SE).
*   Empirical studies involving human developers or real-world repositories.
*   Technical reports on emerging standards (e.g., ISO/IEC 42001).
*   Studies addressing the "Reviewer Bottleneck" or code quality verification.

**Exclusion Criteria:**
*   Papers solely focused on model architecture improvements without workflow context.
*   Studies predating the transformer era (pre-2017) unless used for historical comparison.
*   Purely theoretical papers lacking empirical or case-study grounding.

### 2.2.2 Methodological Frameworks in Analyzed Literature

To understand the validity of the findings presented in the subsequent Analysis chapter, it is essential to critique the methodologies employed by the primary sources. The literature on AI-augmented software engineering currently utilizes three distinct methodological frameworks: quantitative repository mining, qualitative human-centric studies, and experimental benchmarking.

#### 2.2.2.1 Quantitative Repository Mining
A significant portion of the analyzed literature employs repository mining techniques to assess the impact of AI tools on codebases. Researchers utilizing this method extract data from platforms like GitHub or GitLab to measure objective changes in development velocity and code characteristics.

For instance, studies such as those by Zuo et al. {cite_001} utilize historical data from pull requests (PRs) to evaluate the efficacy of AI in automating administrative tasks like PR title generation. The methodological strength of this approach lies in its ecological validity—it analyzes actual artifacts produced during professional development. Key metrics typically extracted in these studies include:

*   **Cycle Time:** The duration from the first commit to PR merge.
*   **Code Churn:** The volume of code added, modified, or deleted.
*   **Acceptance Rate:** The percentage of AI-generated suggestions accepted by human developers.

However, a limitation identified in these methodologies is the difficulty in distinguishing between AI-generated and human-written code without explicit metadata. As noted by Swaraj et al. {cite_009}, as models improve, the statistical distribution of AI-generated code features converges with that of human code, making detection—and thus attribution of "productivity"—increasingly difficult.

#### 2.2.2.2 Qualitative and Human-Centric Approaches
To address the "socio" aspect of socio-technical systems, researchers employ qualitative methods including surveys, interviews, and observational studies. This approach is critical for capturing the "developer experience" (DevEx) and cognitive load, which quantitative metrics often miss.

Ulfsnes et al. {cite_004} and Smit et al. {cite_007} utilize these methods to explore how developers perceive the utility of tools like GitHub Copilot. Their methodologies often involve:
1.  **Semi-structured Interviews:** Allowing developers to articulate trust issues and workflow friction.
2.  **Likert-Scale Surveys:** Quantifying perceived productivity versus actual output.
3.  **Thematic Analysis:** Coding interview transcripts to identify recurring friction points, such as the "Reviewer Bottleneck."

Brandebusemeyer {cite_017} advances this methodology by proposing the use of physiological sensors (wearables) to measure developer stress and focus objectively. This represents a methodological shift from self-reported surveys to biometric data, offering a potential solution to the subjectivity bias inherent in traditional qualitative research.

#### 2.2.2.3 Experimental Benchmarking and Agent Evaluation
The third dominant methodology involves controlled experiments where AI agents are tasked with solving specific software engineering problems. This is distinct from general LLM benchmarking as it focuses on domain-specific tasks.

Zhu and Kang {cite_020} and Xia et al. {cite_022} exemplify this approach through the use of benchmarks like SWE-Bench. Their methodology involves:
*   **Task Definition:** Selecting real-world GitHub issues (bug reports or feature requests).
*   **Agent Deployment:** Running AI agents (e.g., Agentless) to generate patches.
*   **Validation:** Executing test suites to verify if the patch resolves the issue without regression.

This experimental framework allows for rigorous reproducibility but often lacks the complexity of enterprise environments where requirements are ambiguous and human stakeholders are involved.

### 2.2.3 Analytical Metrics and Mathematical Models

A critical component of the methodology is defining the metrics used to evaluate AI performance and impact. The literature has moved beyond simple "accuracy" toward more nuanced probabilistic and productivity-based metrics.

#### 2.2.3.1 Performance and Detection Metrics
In the domain of security and academic integrity, distinguishing between human and AI code is a primary methodological challenge. Swaraj et al. {cite_009} employ standard classification metrics to evaluate detection approaches. Given the class imbalance often present in these datasets (where AI code might be a minority or majority depending on the context), the **F1-Score** is preferred over simple accuracy.

The F1-Score is defined as the harmonic mean of precision and recall:

$$F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$$

Where Precision and Recall are calculated based on True Positives (TP), False Positives (FP), and False Negatives (FN):

$$Precision = \frac{TP}{TP + FP}$$

$$Recall = \frac{TP}{TP + FN}$$

Methodologically, the degradation of these metrics over time serves as an indicator of increasing model sophistication. As generative models improve, the "perplexity" gap between human and machine text narrows, necessitating more complex detection methodologies.

#### 2.2.3.2 Pass@k and Probabilistic Correctness
For evaluating code generation capabilities, the literature frequently employs the **Pass@k** metric. Unlike traditional software testing where a function either passes or fails, generative AI involves probabilistic outputs.

The Pass@k metric estimates the probability that at least one correct solution is generated when $k$ samples are produced. It is calculated as:

$$Pass@k := 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$$

Where:
*   $n$ is the total number of samples generated.
*   $c$ is the number of correct samples among $n$.
*   $k$ is the number of samples selected for evaluation.

This metric is methodologically significant for this thesis because it quantifies the "human-in-the-loop" requirement. If $k$ must be large to ensure a correct solution, the cognitive load on the human reviewer increases, directly contributing to the workflow bottlenecks identified in the literature review.

| Metric Category | Specific Metric | Application in Literature | Formula/Definition |
|-----------------|-----------------|---------------------------|--------------------|
| **Correctness** | Pass@k | Benchmarking code generation | $1 - \binom{n-c}{k}/\binom{n}{k}$ |
| **Detection** | F1-Score | Identifying AI-generated code | Harmonic mean of Precision/Recall |
| **Productivity**| Cycle Time | Workflow analysis | $T_{merge} - T_{first\_commit}$ |
| **Quality** | Code Churn | Maintenance studies | Lines added + modified + deleted |
| **Reliability** | Hallucination Rate| Safety evaluation | % of outputs with factual errors |

*Table 2: Summary of Analytical Metrics. This table categorizes the mathematical and operational definitions used across the reviewed studies {cite_001}{cite_009}{cite_020}.*

### 2.2.4 Evaluation of Governance and Compliance Frameworks

A unique aspect of this methodology is the inclusion of regulatory and governance frameworks as objects of analysis. As AI tools integrate into the software supply chain, compliance with standards becomes a methodological constraint for development workflows.

This review analyzes the implementation of **ISO/IEC 42001**, the international standard for AI Management Systems. As discussed by Seet {cite_032} and Biroğul et al. {cite_033}, evaluating adherence to this standard involves assessing:
1.  **Risk Management:** Methodologies for identifying AI-specific risks (e.g., bias, hallucination).
2.  **Data Governance:** Protocols for training data provenance and leakage prevention.
3.  **Traceability:** The ability to link AI-generated code back to its prompt and model version.

Furthermore, the methodology examines the role of **Software Bill of Materials (SBOM)** in the AI era. Shukla {cite_034} and Syed {cite_036} highlight that traditional SBOM methodologies must evolve to include "AI-BOMs" that account for model weights and training datasets. This thesis evaluates how these emerging standards are reshaping the definition of "quality" in software engineering from purely functional correctness to legal and operational compliance.

### 2.2.5 Synthesis of Workflow Integration Models

To address the central research question regarding workflow integration, this thesis employs a comparative analysis of workflow models described in the literature. This involves mapping the "As-Is" workflow (traditional SE) against the "To-Be" workflow (AI-augmented SE).

The analysis draws upon the "Human-Centered Software Engineering" framework described by Seffah et al. {cite_019} and the trust adoption frameworks proposed by Barón {cite_015}. The methodological step here is to identify friction points where the introduction of AI tools disrupts established patterns.

Key dimensions of this synthesis include:
*   **The Shift Left:** How AI pushes testing and security concerns earlier in the lifecycle {cite_025}.
*   **The Reviewer Role:** How the developer's role transitions from "writer" to "verifier" {cite_040}{cite_041}.
*   **Knowledge Transfer:** How AI impacts the mentorship and onboarding of junior developers, a gap highlighted in the literature review.

### 2.2.6 Limitations of the Methodology

While the narrative review approach allows for a broad synthesis, it carries inherent limitations that must be acknowledged to contextualize the findings.

**Selection Bias:** Unlike a systematic review with blinded selection, the narrative approach relies on the researcher's selection of "representative" texts. This may inadvertently favor high-profile studies or those from major tech companies (e.g., Microsoft/GitHub studies on Copilot) over independent, critical research.

**Rapid Obsolescence:** The field of Generative AI is moving at a velocity that renders specific benchmark results obsolete within months. For example, performance metrics for GPT-3.5 cited in 2023 papers may not reflect the capabilities of GPT-4 or Claude 3.5 in 2025. To mitigate this, the methodology focuses on *patterns of interaction* and *fundamental workflow shifts* rather than static performance numbers.

**Lack of Standardized Reporting:** As noted in the discussion of repository mining, there is no standardized method for tagging AI-generated code in version control systems. This forces reliance on proxy metrics or self-reported data, introducing noise into quantitative analyses of productivity.

**Ecological Validity of Benchmarks:** As highlighted by Zhu and Kang {cite_020}, benchmarks like SWE-Bench, while rigorous, may suffer from data leakage (where the solution is in the training set) or lack the complexity of enterprise legacy systems. This limitation means that "solved" benchmarks do not necessarily translate to "solved" industrial problems.

### 2.2.7 Ethical Considerations in the Review Process

Although this thesis does not involve direct human subject experimentation, ethical considerations remain paramount in the analysis of the literature. The review critically examines how primary studies handle:
*   **Data Privacy:** Particularly in studies mining public repositories where developer identity might be exposed.
*   **Consent:** Whether developers using AI tools in workplace studies were fully aware of the telemetry being collected.
*   **Bias:** How studies account for the Western-centric bias inherent in most LLM training data and its impact on global software engineering practices.

By adhering to this multi-faceted methodological framework—combining narrative synthesis, metric analysis, and critical evaluation of governance standards—this thesis aims to provide a robust answer to how Generative AI is reshaping the professional lives of software engineers.

## 2.3 Analysis and Results

[Content for Analysis and Results would follow here...]