How to Build an AI Agent That Runs its Own LLM Experiments with autoresearch

**Hoje** at 02:15

How to Build an AI Agent That Runs its Own LLM Experiments with autoresearch

Tópico:
How to Build an AI Agent That Runs its Own LLM Experiments with autoresearch

Categoria: Tutoriais | FreeCodeCamp Premium
Idioma Principal: Português (Conteúdo de Tecnologia)

Conteúdo do Tutorial / Guia Passo a Passo:
-------------------------------------------------------------------------
A few months ago, Andrej Karpathy released autoresearch. It's an open-source Python tool that lets an AI agent run experiments on one GPU while you sit back and wait for the results.

Lately I've still seen folks on Twitter arguing about whether AI agents can build their "million dollar idea" or something about Openclaw. But here's a repo that lets you hand an agent a real GPT training setup and ask it to do the research itself.

Basically it edits the code, trains, reads the loss, makes a decision about the result, and repeats this process. And all this happens while you sleep, or dig into something else. And surprisingly, it does actually work.

On a depth-12 nanochat baseline (more on what "depth" means later), Karpathy left it running for about two days. Over roughly 700 experiments, the agent found about 20 changes that genuinely improved the model, and those changes stacked on top of each other.

In this article, I'll walk through what autoresearch is, why the way it measures success is the whole trick, what each file in the repo actually does, what the agent tends to discover, and a step-by-step guide to running it yourself. By the end you should be able to point an agent at your own GPU and let it run.

Table of Contents

• Prerequisites

• What is autoresearch?

• Why This Matters

• What Exactly is

Código Selecionar

val_bpb?

• What the Agent Actually Finds

• Final Thoughts

Prerequisites

This article is a complete walkthrough of this repo. The goal is that by the end, you'll understand what autoresearch is and how you can run it on your own machine.

No prior ML research experience required, but if you have it then the deeper sections I wrote will be more meaningful to you. Just basic knowledge of GPU, VRAM and GPUs like H100/A100/4090 would suffice, but don't worry i have quoted the text below explaining every term i think a beginner needs to understand.

What is autoresearch?

Simply put, autoresearch is just one specific idea executed cleanly. You take a small but real LLM training setup, put it in a single Python file, and let an AI agent edit that file.

The agent runs the file and reads the loss. When you train a language model, "loss" is just a single number that scores how badly the model is predicting the next chunk of text. A high number means it's guessing poorly, and a number close to zero means it's predicting almost perfectly.

Training is the process of nudging the model's millions of internal weights to push that number down. So when I say the agent "reads the loss," I mean it looks at that score to judge whether the change it just made helped or hurt.

Based on that score, the agent decides whether the change helped, and then either keeps the change or reverts it. Then it tries something else.

The flow runs top to bottom like this: A human (you) writes the playbook (a Markdown file called program.md), which spells out the rules. An AI agent reads that playbook and starts an experiment loop.

In each pass of the loop, the agent edits the training code with a new idea, trains for five minutes, reads the resulting score, decides whether to keep or undo the change, and writes the outcome to a results file. Then it loops back and tries the next idea.

It does this on its own, around twelve times an hour. So a full night of sleep buys you roughly a hundred experiments and, with luck, a noticeably better model by morning.

The repo is lai

... [O tutorial continua no link abaixo] ...