Hugging Face releases ML Intern, the AI agent teaching itself to beat Claude Code on scientific reasoning

AICodingResearch

23 Apr

New open source tool autonomously reads papers, pulls datasets and runs GPU training jobs, with Hugging Face putting up $1,000 in compute and Anthropic credits for early users.

Hugging Face ML Intern AI agent illustration showing research, data, code, train and evaluate workflow loop — ***Hugging Face's ML Intern automates the full research loop, from paper discovery and dataset selection through to code execution, model training and evaluation.***

Hugging Face has released ML Intern, an open source AI agent that autonomously researches, writes and runs machine learning code, with early benchmark results showing it outperforming Anthropic's Claude Code on scientific reasoning and OpenAI's Codex on a healthcare evaluation.

The project, built by Hugging Face's AI agents team, is being pitched as an automated version of the post-training research loop used by the company's own ML researchers, and is available today as a CLI and a mobile and desktop web app.

Aksel Joonas Reedi, who works on AI agents at Hugging Face, announced the release on LinkedIn and said the agent pulls papers from arXiv and hf.co/papers, walks citation graphs, pulls and reformats datasets, and launches training jobs on Hugging Face Jobs when local GPUs are not available. Hugging Face is also provisioning $1,000 in GPU resources and Anthropic credits for early users.

ML Intern beats Claude Code and Codex on benchmarks

Reedi said on LinkedIn that ML Intern was set to train the best LLM for scientific reasoning and found NVIDIA research including OpenScience and NemoTron-CrossThink through citation searches, before running 12 supervised fine-tuning passes on Qwen3-1.7B: "This pushed the score 10% → 32% on GPQA in under 10h. Claude Code's best: 22.99%," he posted.

On a separate healthcare test, Reedi said the agent decided existing datasets were too low quality, wrote a script to generate 1,100 synthetic data points covering emergency, client and multilingual communication, then upsampled the data 50 times for training. He said it "Beat Codex on HealthBench by 60%." For a competitive mathematics task, he added that the agent wrote a full GRPO training script, launched it on A100 GPUs via Hugging Face Spaces, and ran ablations after initial rewards collapsed until it succeeded.

What developers and researchers get

According to the project's public documentation, ML Intern runs an agentic loop of up to 300 iterations per task, with a context manager that handles message history and auto compaction, a tool router for Hugging Face docs, datasets, jobs and papers, plus GitHub code search and sandboxed execution. The CLI installs via uv and accepts any inference provider model ID, with default configuration pointing to Anthropic's Claude models.

Reedi said on LinkedIn that the agent "deeply embodies how researchers work and think" and "knows how data should look like and what good models feel like."

Hugging Face puts $1,000 in GPU credits behind launch

Reedi added that Hugging Face has provisioned $1,000 in GPU resources and Anthropic credits for "the quickest" early users of the tool, with the CLI and web app both live today. The incentive lands as universities, bootcamps and EdTech startups are under pressure to give students and staff hands-on access to model training without paying commercial cloud rates.

The immediate question is how ML Intern's autonomy profile holds up outside curated benchmarks, particularly on messy real-world education datasets where data quality, consent and licensing all apply. Hugging Face says the agent is open source and built on its own ecosystem, which means the community will be able to test those limits in public.