Federico De Ponte
Founder, OpenDraft
Citation Hallucination Benchmark: OpenDraft vs GPT-5.2, GPT-4o, GPT-3.5
We ran the same research prompts through OpenDraft, GPT-5.2 (OpenAI's newest model), GPT-4o, and GPT-3.5, then verified every citation against real academic databases. The results speak for themselves.
The Results
100%
OpenDraft
295 citations, 0 fabricated
8.5%
GPT-5.2 Fabricated
47 citations, 4 fabricated
10.2%
GPT-4o Fabricated
49 citations, 5 fabricated
64.6%
GPT-3.5 No DOIs
48 citations, 31 unverifiable
The Test
We created 10 research prompts across academic disciplines, each requesting a literature review with citations:
- Computer Science - Transformer architectures in NLP
- Medicine - CRISPR gene therapy advances
- Psychology - Social media and adolescent mental health
- Economics - Universal basic income research
- Environmental Science - Microplastics in marine ecosystems
- Education - Online vs. classroom learning
- Physics - Quantum computing for optimization
- Sociology - Income inequality and social mobility
- Neuroscience - Neurobiological mechanisms of addiction
- Business - Remote work productivity
Each prompt asked for a 500-word literature review with 5+ citations including DOIs.
Verification Method
For each citation, we checked:
- CrossRef API - 130M+ indexed publications
- arXiv API - 2M+ preprints
- Semantic Scholar - 200M+ papers
- doi.org resolver - Fallback verification
If a DOI doesn't exist in any of these databases, it's marked as fabricated.
Results Breakdown
Summary Comparison
| Model | Citations | Verified | Fabricated | Unverifiable |
|---|---|---|---|---|
| OpenDraft | 295 | 295 (100%) | 0 (0%) | 0 (0%) |
| GPT-5.2 | 47 | 39 (83%) | 4 (8.5%) | 4 (8.5%) |
| GPT-4o | 49 | 44 (89.8%) | 5 (10.2%) | 0 (0%) |
| GPT-3.5 Turbo | 48 | 16 (33.3%) | 1 (2.1%) | 31 (64.6%) |
Note: GPT-3.5 has a low fabrication rate but most citations lack DOIs, making them unverifiable.
OpenDraft - 8 Prompts (2 timed out)
| Discipline | Citations | Verified | Fabricated |
|---|---|---|---|
| Computer Science | 44 | 44 | 0 |
| Medicine | 34 | 34 | 0 |
| Psychology | 50 | 50 | 0 |
| Economics | 33 | 33 | 0 |
| Environmental Science | 36 | 36 | 0 |
| Education | 37 | 37 | 0 |
| Physics | 33 | 33 | 0 |
| Sociology | 28 | 28 | 0 |
| Total | 295 | 295 (100%) | 0 (0%) |
A Note on Volume
GPT models produced ~5 citations per prompt (what was requested). OpenDraft produced ~37 citations per prompt. This isn't a flaw in the comparison — it's an architectural difference:
- GPT models generate the minimum citations needed
- OpenDraft queries academic databases and returns all relevant papers
The fabrication rate is what matters: even OpenAI's newest GPT-5.2 fabricated 8.5% of its citations, GPT-4o fabricated 10.2%. OpenDraft fabricated 0%.
Why the Difference?
The difference isn't prompt engineering. It's architecture.
GPT Models Approach
- Generates citations from training data
- No real-time database access
- Cannot verify if DOIs exist
- Produces plausible-looking but fake citations
OpenDraft Approach
- Queries CrossRef & Semantic Scholar in real-time
- Only includes papers that exist in databases
- Validation phase removes any unverifiable citations
- Every DOI is checked before output
Reproduce It Yourself
The entire benchmark is open source. Run it yourself:
git clone https://github.com/federicodeponte/opendraft cd opendraft/benchmark # Verify citations python3 verify_citations.py responses/chatgpt/prompt_1.txt -o results/chatgpt_1.json python3 verify_citations.py responses/opendraft/prompt_1.txt -o results/opendraft_1.json
All prompts, responses, and verification code are included. Check our work.
Related Articles
Why Hallucination is a Design Failure
The architectural reason ChatGPT can't avoid fake citations.
How 19 AI Agents Work Together
Technical deep-dive into OpenDraft's architecture.
Methodology: Benchmark conducted December 2024. GPT-5.2 (gpt-5.2), GPT-4o (gpt-4o), GPT-3.5 Turbo (gpt-3.5-turbo) via OpenAI API. OpenDraft v1.0 with Gemini + CrossRef + Semantic Scholar.
Full data: Download raw JSON results