## Paper 14: Comparative Analysis of AI Models for Python Code Generation: A HumanEval Benchmark Study
**Authors:** Bayram, Menekse Dalveren, Derawi
**Year:** 2025
**Venue:** Applied Sciences (MDPI)
**DOI:** [10.3390/app15189907](https://doi.org/10.3390/app15189907)

### Research Question
Which contemporary LLM family (Claude vs. GPT) performs better in Python code generation regarding correctness, complexity, and maintainability?

### Methodology
- **Design:** Comparative Benchmarking.
- **Approach:** Evaluated 6 models (GPT-3.5, GPT-4o, Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude Sonnet 4, Claude Opus 4) on 164 Python problems.
- **Data:** HumanEval Benchmark.
- **Metrics:** Pass@1, Cyclomatic Complexity, Maintainability Index, LOC.

### Key Findings
1. **Performance Leader:** **Claude Sonnet 4** achieved the highest success rate (95.1%), followed by Claude Opus 4 (94.5%).
2. **Model Family Gap:** Anthropic Claude models outperformed OpenAI GPT models by margins exceeding **20%** across metrics.
3. **Code Quality:** Claude models generated more sophisticated and maintainable solutions; GPT models favored simpler but less reliable strategies.
4. **Significance:** Statistical difference confirmed (p < 0.001).

### Implications
Suggests a shift in the "State of the Art" leadership in coding tasks from OpenAI to Anthropic. It emphasizes that *accuracy* is no longer the only metric; *maintainability* is now measurable.

### Limitations
- **Model Versions:** The paper cites "Claude Sonnet 4" and "Claude 3.7". *[CRITICAL NOTE]:* As of late 2024, these models were not publicly confirmed. This paper may be using unreleased betas, or there is a risk of hallucination in the abstract/paper regarding model nomenclature.
- **Benchmark Saturation:** HumanEval is widely considered "saturated" (too easy) for 2025-era models.

### Relevance to Your Research
**Score:** ⭐⭐⭐⭐⭐ (5/5)
**Why:** Provides hard numbers and direct model comparisons.

---