Tampere University study questions whether bigger AI models write better code

AIResearch

19 Feb

GPT-Lab research finds smaller language models can compete on code generation while using a fraction of the memory.

Researchers at GPT-Lab, Tampere University in Finland, have published a peer-reviewed study examining whether larger language models consistently outperform smaller ones in code generation tasks. The findings suggest that while scale improves accuracy, the gains come with significant computational costs.

The paper, Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks, published in the Journal of Systems and Software, evaluates 20 open-source small language models ranging from 0.4 billion to 10 billion parameters. The study benchmarks them across established code generation tasks to measure accuracy, stability, and resource requirements.

The research was shared publicly by GPT-Lab, raising a direct question: do bigger models always mean better code?

Benchmarking 20 small models

The study evaluates models using widely recognized code generation benchmarks, including HumanEval and MBPP, testing their ability to generate correct and executable code across programming tasks.

Across benchmarks, larger models generally achieved higher pass rates and more stable outputs. However, performance gains were incremental rather than exponential. According to the paper’s empirical results, increasing model size often improved accuracy, but at diminishing returns relative to hardware demands.

A key finding highlighted in the study is that achieving roughly a 10 percentage point improvement in accuracy can require approximately four times more VRAM. This memory requirement has direct implications for deployment in constrained environments, including university labs, startups, and edge systems.

The researchers also observed that smaller models demonstrated similar behavior patterns across multiple programming languages, suggesting that scaling effects were consistent rather than language-specific.

Accuracy versus computational cost

The paper provides detailed evaluation metrics comparing model size against hardware requirements. On lower-end GPUs, several smaller models delivered competitive results relative to their resource footprint.

The authors note that while larger models remain more accurate overall, smaller models can represent a more efficient trade-off when compute availability is limited. In scenarios where infrastructure costs, latency, or energy use are factors, the marginal accuracy gains of larger systems may not justify the additional resource demand.

This efficiency consideration is particularly relevant for educational institutions and research environments where GPU access is finite.

Implications for AI deployment in education and software engineering

For EdTech and computer science education contexts, the findings are material. AI-assisted coding tools are increasingly integrated into teaching environments, but infrastructure constraints often limit which models can be deployed at scale.

The study suggests that model selection should be use-case driven rather than size-driven. In classroom or institutional settings where dozens or hundreds of users rely on shared compute, smaller open-source models may offer a practical balance between capability and scalability.

The research also signals a broader industry shift. As AI systems mature, optimization and deployment efficiency are becoming as strategically important as leaderboard performance. In software engineering workflows, reliability, cost, and energy footprint are increasingly part of the equation.

Rather than reinforcing the assumption that scale alone defines capability, the Tampere study positions efficiency as a first-order consideration. When memory and compute matter, smaller models may often make more sense.

ETIH Innovation Awards 2026

The ETIH Innovation Awards 2026 are now open and recognize education technology organizations delivering measurable impact across K–12, higher education, and lifelong learning. The awards are open to entries from the UK, the Americas, and internationally, with submissions assessed on evidence of outcomes and real-world application.

Featured