Skip to main content
OpenDraft
Back to Home
Federico De Ponte

Federico De Ponte

Founder, OpenDraft

6 min read
Case Study

Citation Hallucination Benchmark: OpenDraft vs GPT-5.2, GPT-4o, GPT-3.5

We ran the same research prompts through OpenDraft, GPT-5.2 (OpenAI's newest model), GPT-4o, and GPT-3.5, then verified every citation against real academic databases. The results speak for themselves.

The Results

100%

OpenDraft

295 citations, 0 fabricated

8.5%

GPT-5.2 Fabricated

47 citations, 4 fabricated

10.2%

GPT-4o Fabricated

49 citations, 5 fabricated

64.6%

GPT-3.5 No DOIs

48 citations, 31 unverifiable


The Test

We created 10 research prompts across academic disciplines, each requesting a literature review with citations:

  1. Computer Science - Transformer architectures in NLP
  2. Medicine - CRISPR gene therapy advances
  3. Psychology - Social media and adolescent mental health
  4. Economics - Universal basic income research
  5. Environmental Science - Microplastics in marine ecosystems
  6. Education - Online vs. classroom learning
  7. Physics - Quantum computing for optimization
  8. Sociology - Income inequality and social mobility
  9. Neuroscience - Neurobiological mechanisms of addiction
  10. Business - Remote work productivity

Each prompt asked for a 500-word literature review with 5+ citations including DOIs.

Verification Method

For each citation, we checked:

  1. CrossRef API - 130M+ indexed publications
  2. arXiv API - 2M+ preprints
  3. Semantic Scholar - 200M+ papers
  4. doi.org resolver - Fallback verification

If a DOI doesn't exist in any of these databases, it's marked as fabricated.

Results Breakdown

Summary Comparison

ModelCitationsVerifiedFabricatedUnverifiable
OpenDraft295295 (100%)0 (0%)0 (0%)
GPT-5.24739 (83%)4 (8.5%)4 (8.5%)
GPT-4o4944 (89.8%)5 (10.2%)0 (0%)
GPT-3.5 Turbo4816 (33.3%)1 (2.1%)31 (64.6%)

Note: GPT-3.5 has a low fabrication rate but most citations lack DOIs, making them unverifiable.

OpenDraft - 8 Prompts (2 timed out)

DisciplineCitationsVerifiedFabricated
Computer Science44440
Medicine34340
Psychology50500
Economics33330
Environmental Science36360
Education37370
Physics33330
Sociology28280
Total295295 (100%)0 (0%)

A Note on Volume

GPT models produced ~5 citations per prompt (what was requested). OpenDraft produced ~37 citations per prompt. This isn't a flaw in the comparison — it's an architectural difference:

  • GPT models generate the minimum citations needed
  • OpenDraft queries academic databases and returns all relevant papers

The fabrication rate is what matters: even OpenAI's newest GPT-5.2 fabricated 8.5% of its citations, GPT-4o fabricated 10.2%. OpenDraft fabricated 0%.

Why the Difference?

The difference isn't prompt engineering. It's architecture.

GPT Models Approach

  • Generates citations from training data
  • No real-time database access
  • Cannot verify if DOIs exist
  • Produces plausible-looking but fake citations

OpenDraft Approach

  • Queries CrossRef & Semantic Scholar in real-time
  • Only includes papers that exist in databases
  • Validation phase removes any unverifiable citations
  • Every DOI is checked before output

Reproduce It Yourself

The entire benchmark is open source. Run it yourself:

git clone https://github.com/federicodeponte/opendraft
cd opendraft/benchmark

# Verify citations
python3 verify_citations.py responses/chatgpt/prompt_1.txt -o results/chatgpt_1.json
python3 verify_citations.py responses/opendraft/prompt_1.txt -o results/opendraft_1.json

All prompts, responses, and verification code are included. Check our work.

Try OpenDraft

Generate your own research draft with 100% verified citations.

Get Started Free

Related Articles


Methodology: Benchmark conducted December 2024. GPT-5.2 (gpt-5.2), GPT-4o (gpt-4o), GPT-3.5 Turbo (gpt-3.5-turbo) via OpenAI API. OpenDraft v1.0 with Gemini + CrossRef + Semantic Scholar.

Full data: Download raw JSON results