Class 4: Agentic AI — From Zero to Hero

Class 4: Agentic AI — From Zero to Hero#

The AI Coding Revolution#

In 2024, GitHub published a controlled study: developers with access to AI coding assistants completed tasks 55% faster than those without. More than 75% of professional developers now use AI tools daily. For scientists who write analysis code, the implications are just as large — and the risks are more significant, because the stakes of a silent error in a data pipeline are much higher than a bug in a web app.

But here’s the crucial distinction that separates effective users from frustrated ones:

Knowing how these tools work tells you exactly when and why they fail — and that knowledge is what lets you use them productively.

This class gives you the mental model, the tooling setup, and the workflow to go from complete beginner to confident daily user.

AI CLI

Part 1: How LLMs Work (and Why They Fail)#

Tokens — The Building Blocks#

Before anything else: LLMs don’t read words, they read tokens. A token is a sub-word unit produced by a tokenizer. For most English text, one token ≈ one word or one punctuation mark, but short words are one token and long words are split:

Text

Tokens

cat

cat

neuroscience

neur oscience

prefrontal

pre frontal

np.zeros(10)

np . zeros ( 10 )

Modern models have vocabularies of 100 000–200 000 tokens. This matters because:

  • The model’s context window (how much it can “see”) is measured in tokens

  • Token limits affect how much code, how many files, and how long a conversation the model can hold at once

  • Rare words (species names, lab-specific abbreviations) may tokenize poorly and be represented less reliably

The Core Mechanism: Next-Token Prediction#

Everything these models do reduces to one operation:

Given all the tokens seen so far, output a probability distribution over every possible next token. Sample from that distribution. Repeat.

Input tokens:  "import numpy as"
               ────────────────
Model output:  P(" np")      = 0.89
               P(" numpy")   = 0.04
               P(" pd")      = 0.02
               P(" plt")     = 0.01
               P(everything else) ≈ 0.04
               ────────────────
Sampled token: " np"

Full output:   "import numpy as np"

That’s it. The model never “looks up” an answer in a database. It generates the most statistically likely continuation of the text you gave it, character by character (token by token).

When that statistical likelihood aligns with correctness — as it usually does for well-represented patterns like Python syntax — the output is very useful. When it doesn’t — as it often does for obscure facts, recent events, or domain-specific reasoning — the output can be fluently, confidently wrong.

The architecture behind most LLMs is the Transformer, introduced in the landmark 2017 paper “Attention Is All You Need” by Vaswani et al. at Google. The key innovation is self-attention: a mechanism that lets the model simultaneously consider relationships between all pairs of tokens in its input window, rather than reading left-to-right sequentially.

This is why LLMs are so much better than earlier approaches at tasks like code completion, where a variable defined at line 1 needs to be tracked through to line 200.

The Training Pipeline#

Modern LLMs are trained in three stages:

┌─────────────────────────────────────────────────────────────────────────┐
│  STAGE 1: PRE-TRAINING                                                  │
│  Data: the entire readable internet + books + code + papers             │
│  Task: predict the next token (self-supervised — no labels needed)      │
│  Cost: months of compute, millions of dollars                           │
│  Result: a model that "knows" about language, facts, and code           │
└───────────────────────────────┬─────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  STAGE 2: SUPERVISED FINE-TUNING (SFT)                                  │
│  Data: curated examples of good question→answer conversations           │
│  Task: predict the assistant's ideal response                           │
│  Result: the model learns to behave like a helpful assistant            │
└───────────────────────────────┬─────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  STAGE 3: RLHF / RLAIF                                                  │
│  (Reinforcement Learning from Human / AI Feedback)                      │
│  Process: humans or another AI rate pairs of responses;                 │
│           a reward model learns these preferences;                      │
│           the LLM is trained via RL to maximize the reward              │
│  Result: helpful, polite responses — but also sycophancy (see below)    │
└─────────────────────────────────────────────────────────────────────────┘

Why this matters for you: The model’s “knowledge” comes from the pre-training corpus, which has a cutoff date. Anything published after that date is invisible to the model unless explicitly provided in the conversation. RLHF explains why the model is agreeable — sometimes overly so.

Failure Mode 1: Hallucination#

The model has no “I don’t know” state. When it has no reliable pattern to draw on, it generates whatever is most statistically likely — which often looks correct and is not. This is called hallucination.

Common hallucinations you will encounter:

Invented citations

User: What papers should I read about calcium imaging in C. elegans?

Model: I recommend:
  - "Whole-brain calcium imaging in C. elegans reveals neural correlates
    of locomotion" by Chen, Park & Bhattacharya (Nature Neuroscience, 2019)

Reality: This paper does not exist. The authors, title, journal, and year
all sound plausible — none of it is real.

Non-existent API functions

# Model suggests:
mne.preprocessing.remove_ecg_artifacts_ica(raw, method='fastica')

# Reality: This function does not exist in MNE-Python.
# The model knows MNE-Python, knows it has ICA, and filled in the rest.

Plausible but wrong explanations

User: Why does my EEG signal have a 50 Hz artifact?

Model: This is due to the natural resonance frequency of neural tissue,
which couples with your amplifier circuit at...

Reality: It's mains electrical interference. The model gave a
plausible-sounding but completely wrong explanation.

Defense against hallucination

Never accept a citation without finding the actual paper (PubMed, Google Scholar). Never use a function without checking the library’s documentation. The model is confident regardless of whether it knows the answer.

Failure Mode 2: Sycophancy#

RLHF trains the model to maximize human approval ratings. Since humans tend to rate agreeable, validating responses highly, the model develops a systematic bias toward telling you what you want to hear.

Example:

User:   What's 7 × 8?
Model:  56.

User:   Are you sure? I thought it was 54.
Model:  You're right, I apologize — it's 54.   ← WRONG. It's 56.

The model will:

  • Change a correct answer if you express doubt

  • Praise your code even when it has bugs

  • Agree with your proposed solution even if it’s wrong

  • Downplay concerns you seem dismissive of

The defense: Never use social pressure to get a different answer. If you think the model is wrong, provide evidence or a counter-argument. Say: “What is 7 × 8? Please verify by multiplication.” — not — “Are you sure? I think it’s 54.”

Sycophancy is structural, not a glitch. It cannot be prompted away entirely.

Failure Mode 3: Context-Window Limits#

Every LLM has a maximum number of tokens it can process at once — the context window. As of 2025, models range from ~100K to ~1M tokens (roughly 75 000–750 000 words). This sounds enormous, but it fills up fast:

Item

Approx. tokens

One Python file (200 lines)

~1 500

Entire codebase (50 files)

~75 000

Long conversation (1 hour)

~20 000

One research paper (PDF)

~8 000

What happens when context fills up:

  • The model silently forgets earlier parts of the conversation

  • Code written earlier in the session may be inconsistent with new code

  • Constraints you stated at the start (“never modify data/raw/”) may be forgotten

  • The model does not warn you when this happens

Defenses:

  • Use /compact in Claude Code to compress conversation history

  • For long tasks, start a fresh session with a focused CLAUDE.md

  • Break large codebases into smaller, focused sessions

Failure Mode 4: No Self-Verification#

The model generates text. It cannot:

  • Run the code it writes to check if it works

  • Look up whether the function it mentions actually exists

  • Verify a calculation by computing it independently

  • Check whether a paper it cites is real

It generates the appearance of correctness. You are the verification step.

This is not a bug that will be fixed — it’s a fundamental property of the architecture. Some systems (like Claude Code, which can actually run code) partially address this by adding a tool-use layer on top of the model, but the underlying model still cannot verify its own outputs.

The One Mental Model You Need

LLMs are stochastic next-token predictors.

They are excellent at:

  • Completing well-established patterns (Python boilerplate, common idioms)

  • Explaining concepts that are well-represented in text

  • Reformatting, renaming, and restructuring code

  • Writing first drafts you’ll then edit

They are unreliable for:

  • Precise factual recall (especially citations, specific API details, recent events)

  • Mathematical reasoning beyond a few steps

  • Code that requires running to verify

  • Anything where “sounds right” ≠ “is right”

Keep this model in mind every time you use these tools.

Part 2: The Agentic AI Landscape#

What Makes a Tool “Agentic”?#

A regular AI assistant answers questions in a chat interface. You ask, it answers, you act.

An agentic AI takes actions autonomously: it reads files, writes code, runs tests, executes shell commands, and iterates toward a goal across many steps — with you approving or reviewing along the way. The agent is the loop; you are the supervisor.

Regular AI:    User asks → Model answers → User acts
                                    ↑
Agentic AI:    User asks → Model plans → Model acts → Model observes result
                                               ↗            │
                                         (repeat)  ←────────┘
                                         until goal reached

This distinction matters because agentic tools can cause real effects — creating files, running commands, committing code. With that power comes the need for careful oversight.

Tool Comparison (as of mid-2025)#

Tool

Interface

Underlying model

Best for

Claude Code

Terminal (CLI)

Claude (Anthropic)

Agentic coding, multi-file edits, research workflows, plan mode

Gemini CLI

Terminal (CLI)

Gemini (Google)

Large context tasks, Google ecosystem integration

GitHub Copilot

IDE extension

GPT-4o / Claude

Inline completions, PR reviews, staying in IDE flow

Cursor

Full IDE (VS Code fork)

GPT-4o / Claude

GUI-based agentic editing, Composer mode

Windsurf

Full IDE (VS Code fork)

Custom (Codeium)

Cascade agentic mode, GUI workflow

ChatGPT / GPT-4o

Web / API

GPT-4o (OpenAI)

Conversational tasks, Advanced Data Analysis

Tip: How to choose

All of these tools are capable. Choose based on your workflow:

  • If you live in the terminal → Claude Code or Gemini CLI

  • If you prefer a GUI IDE → Cursor or Windsurf

  • If you want quick inline help while coding → GitHub Copilot

  • If you’re doing ad-hoc data exploration → ChatGPT Advanced Data Analysis

Claude Code#

Claude Code is Anthropic’s terminal-based agentic coding assistant. It integrates directly with your shell, has full access to your filesystem (with your permission), and maintains a conversational context about your entire project.

Distinctive features:

Feature

What it does

Plan mode

Shows a step-by-step plan before executing anything; you approve before files are touched

CLAUDE.md

Project-level instruction file; loaded automatically at session start

Hooks

Shell commands triggered automatically before/after agent actions

MCP servers

Plug-in tools that extend agent capabilities (databases, web search, APIs)

Permission modes

Fine-grained control over what the agent can do without asking

/compact

Compresses conversation history to preserve context window

Claude Code is the tool we’ll focus on in this class. Most concepts transfer directly to other agentic tools.

GitHub Copilot#

GitHub Copilot is the most widely deployed AI coding tool, embedded as an extension in VS Code, JetBrains IDEs, Neovim, and others. Its main strength is staying in your IDE flow: as you type, it suggests completions in gray ghost text that you accept with Tab.

Beyond inline completions, Copilot Chat lets you ask questions about your code, and Copilot Workspace (newer) supports multi-file agentic tasks within the IDE.

Best for: developers who want AI help without leaving their IDE, and for quick single-file completions and explanations.

Cursor and Windsurf#

Both are full IDE forks built on VS Code — meaning your existing VS Code extensions, settings, and keybindings work as-is. The AI is embedded more deeply than a plugin.

Cursor’s “Composer” and Windsurf’s “Cascade” allow you to describe a goal and have the AI autonomously edit multiple files to achieve it, within the GUI. This is the closest GUI equivalent to the terminal-based agents.

When to choose these: if you’re more comfortable in a GUI environment, or if your team isn’t comfortable with terminal-based tools, Cursor/Windsurf give you most of the agentic power with a familiar interface.

ChatGPT / GPT-4o Advanced Data Analysis#

OpenAI’s Advanced Data Analysis (code interpreter) mode runs real Python code in a sandboxed container. You can upload a CSV, describe what you want to plot or analyse, and the model will write and run the code — showing you the output and iterating if something fails.

Best for: rapid exploratory analysis when you want to stay conversational and visual, especially for quick sanity checks on data you’ve just received.

Landscape warning

Capabilities, benchmark rankings, and pricing change every few months. By the time you read this, new models and tools will have been released. Evaluate tools on your actual tasks, not on benchmarks or press releases.

Part 3: Getting Started with Claude Code#

Prerequisites and Installation#

What you need:

  • Node.js 18 or later (node --version to check)

  • An Anthropic account at console.anthropic.com

  • Terminal access (macOS Terminal, Linux bash/zsh, Windows WSL2)

Install:

npm install -g @anthropic-ai/claude-code

Verify:

claude --version

First-time authentication:

claude

The first launch opens a browser window to authenticate with your Anthropic account. After that, credentials are stored locally and you won’t need to authenticate again.

Anthropic API costs

Claude Code uses Anthropic’s API, which has a per-token cost. For typical development sessions, costs are a few cents to a few dollars per hour. You can monitor usage and set spending limits in the Anthropic console. As of 2025, students and researchers can apply for research API credits.

Your First Session#

Navigate to a project directory and launch:

cd ~/my_analysis_project
claude

You’ll see a prompt. Type your first message:

> What files are in this directory and what do they do?

Claude Code will read the directory structure and describe each file. It already has context of your working directory. You haven’t written any code yet — you’ve just introduced it to your project.

A more useful first interaction:

> I have EEG recordings in data/raw/ as .edf files. The file names follow the
  pattern S{subject_id}_{condition}.edf. Describe what I'd need to write to
  load all of these files and extract the subject IDs and conditions.

This kind of contextual question — where you describe the structure of your data and ask for an approach before writing any code — is one of the most powerful uses of the tool.

Claude Code session example

Claude Code running in a terminal. The model describes files in the project, proposes an approach, and waits for instructions. Note the token cost shown at the top right — monitoring this helps you manage API spend.

Note

Screenshot placeholder — this image will show a real Claude Code session. Capture it by running claude in a project directory and asking a question about the codebase.

Essential Commands and Keyboard Shortcuts#

Keyboard shortcuts (in the session):

Shortcut

Action

Shift+Tab

Toggle plan mode on/off

Esc

Cancel the current operation (safe — undoes nothing)

Ctrl+C

Interrupt generation mid-stream

Up/Down

Navigate conversation history

Slash commands (type in the prompt):

Command

What it does

/help

Show all available commands

/clear

Clear conversation history (fresh start)

/compact

Compress history to save context window

/cost

Show token usage and cost for this session

/model

Switch to a different Claude model

/status

Show current session state

/mcp

List connected MCP servers

Useful interaction patterns:

> explain [file or function]        # understand existing code
> add [feature] to [file]           # targeted feature addition  
> fix the error in [file]: [error]  # paste the error directly
> write tests for [function]        # test generation
> run the tests and fix failures    # agentic loop (uses shell)
> what changed since the last commit # git-aware context

Part 4: Setting Up Your Project#

The CLAUDE.md File — Your Project’s Briefing Document#

The single most impactful thing you can do before using Claude Code on a real project is write a CLAUDE.md file. It lives at the root of your repository and is automatically loaded by Claude Code at the start of every session. Think of it as the briefing you’d give a new collaborator on their first day.

A CLAUDE.md for a neuroscience analysis project:

# Auditory Attention EEG Analysis

Analyses EEG data from a selective auditory attention task (N=24 subjects,
64-channel BrainAmp, 1000 Hz sampling rate). The paradigm: subjects listen to
two simultaneous speech streams and attend to one. We measure neural tracking
of the attended vs. unattended stream via temporal response functions (TRFs).

## Directory Structure
- `data/raw/`       — original .vhdr/.vmrk/.eeg files — **DO NOT MODIFY**
- `data/processed/` — MNE Raw/Epochs objects as .fif files
- `data/results/`   — TRF objects, statistical results as .pkl/.csv
- `src/`            — all source Python modules
- `notebooks/`      — exploratory Jupyter notebooks (not run at build time)
- `tests/`          — pytest test suite

## Commands
- Run all tests:    `pytest tests/ -v`
- Process subject:  `python src/pipeline.py --subject S01`
- Fit TRFs:         `python src/trf.py --all-subjects`
- Generate figures: `python src/figures.py`

## Python Environment
- Python 3.11, managed with conda: `conda activate eeg_attention`
- Key libraries: MNE-Python 1.7, mne-features 0.3, mtrf-py 1.1

## Coding Conventions
- All functions: type hints + NumPy-style docstrings
- Random seeds: `rng = np.random.default_rng(seed=42)` always
- EEG operations: use MNE-Python only (no custom filtering/epoching)
- Statistical thresholds: α = 0.05, FDR-corrected unless noted otherwise

## Hard Constraints — Do Not Break These
- Never commit files from `data/` (too large, contains participant data)
- Never overwrite files in `data/raw/`
- All figure-generating scripts must be reproducible (fixed seeds, no
  random operations without seeding)
- Keep `src/legacy/` intact (deprecated code preserved for reviewer reference)

Notice the key sections:

  1. Project description — gives the model scientific context

  2. Directory structure — tells it what’s where and what’s off-limits

  3. Commands — so it can actually run things correctly

  4. Environment — prevents it from suggesting wrong Python version or library

  5. Conventions — so new code matches existing style

  6. Hard constraints — lines that must not be crossed

Write CLAUDE.md before your first session, not after

The most common frustration with AI coding tools is: “It keeps doing the wrong thing.” In 90% of cases, the fix is a better CLAUDE.md. The model cannot guess your project structure, your conventions, or your constraints — but it will follow them precisely if you write them down.

Spend 15 minutes on CLAUDE.md before your first session. You will save hours of corrections.

What to Keep Out of CLAUDE.md#

Don’t put secrets (API keys, passwords) in CLAUDE.md. Don’t put large data descriptions — the model should discover data structure by reading files. Don’t make it so long that the model can’t parse it; a good CLAUDE.md is 1–2 pages.

Permission Modes#

By default, Claude Code asks for confirmation before writing files or running shell commands. This is the right setting while you’re learning. As you build trust in your workflows, you can adjust:

Mode

Behaviour

When to use

Default

Ask before every file write and shell command

Starting out, unfamiliar projects

--autoApproveTools read

Auto-approve read-only operations

Exploration sessions

--autoApproveTools write

Auto-approve file writes (still asks for shell)

Trusted projects with good tests

You can also set per-tool permissions in .claude/settings.json:

{
  "autoApproveTools": ["Read", "Glob", "Grep"],
  "permissions": {
    "allow": ["Read(*)", "Write(src/**)", "Bash(pytest*)"],
    "deny":  ["Bash(rm*)", "Bash(git push*)"]
  }
}

This example allows the agent to read anything, write to src/, and run pytest commands — but never rm or git push. Precise control over what the agent can do without supervision.

Never use --dangerously-skip-permissions

This flag disables all confirmation prompts. Do not use it outside of carefully sandboxed, throwaway environments. It’s the equivalent of giving someone root access to your machine with no oversight.

Git Integration#

Claude Code is git-aware by default. It can read your git history, check what’s changed, and perform git operations. Before every significant AI-assisted session:

# Verify you're on a clean branch
git status

# Create a feature branch for AI-assisted work
git checkout -b ai/add-trf-analysis

# Work with Claude Code...

# Review every change before committing
git diff
git add -p   # stage interactively, hunk by hunk
git commit -m "Add TRF analysis module"

Working on a dedicated branch means you can always git checkout main to get back to a known-good state if something goes wrong. This is your safety net.

Part 5: The Core Workflow#

The Golden Cycle#

Effective use of agentic AI comes down to one repeating cycle:

┌──────────────────────────────────────────────────────────────┐
│                                                              │
│   1. ASK WELL ──→ 2. REVIEW PLAN ──→ 3. EXECUTE            │
│        ↑                                    │               │
│        │                                    ▼               │
│   6. COMMIT ←── 5. TEST ←────── 4. REVIEW DIFF             │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Each step is small and reviewable. Never let the agent accumulate many changes without a review step — a single wrong assumption early in a long chain can corrupt everything that follows.

Step 1: Ask Well#

The quality of the output scales directly with the quality of the input.

Anatomy of a good prompt:

[Specific file and function]
[The exact problem or desired outcome]
[Any constraints or conventions to follow]
[What success looks like]

Examples:

Too vague — the model will make assumptions and likely get them wrong:

> Fix the analysis script

Specific and context-rich:

> In src/spectral.py, the function compute_coherence() raises a ZeroDivisionError
  when the power spectrum array is empty (shape (0,)). Add a guard that returns
  np.nan in that case. Also add a pytest test in tests/test_spectral.py that
  covers this edge case. Follow the existing test style in that file.

The second prompt tells the model:

  • Which file and function

  • The exact error and when it occurs

  • The desired behaviour (np.nan)

  • Where to put the test and what style to use

No ambiguity means no wrong assumptions.

Step 2: Plan Mode — The Most Important Feature#

For any change that touches more than one file, enable plan mode before submitting (Shift+Tab). The agent generates a step-by-step plan and waits for your approval before taking any action.

What a good plan looks like:

I'll make the following changes:

1. Read src/spectral.py to understand the current compute_coherence() signature
   and how the power spectrum is computed.

2. Edit src/spectral.py:
   - Add a guard at the top of compute_coherence(): if the input array is
     empty, return np.nan immediately.

3. Read tests/test_spectral.py to understand the existing test style.

4. Edit tests/test_spectral.py:
   - Add test_compute_coherence_empty_spectrum() that calls compute_coherence()
     with an empty array and asserts the result is np.nan.

Shall I proceed?

Questions to ask yourself while reading the plan:

  • Does it correctly understand what I asked?

  • Are all the files it plans to touch the right ones?

  • Is there anything it’s not doing that it should?

  • Is it going to touch anything it shouldn’t?

You can ask follow-up questions or give corrections before approving:

> Before you do step 4, also check if there are any other tests for
  compute_coherence() that might need updating.

Plan mode output

Plan mode shows the agent’s intended actions before any file is touched. This is your primary opportunity to catch misunderstandings.

Note

Screenshot placeholder — capture this by typing Shift+Tab to toggle plan mode on, then submitting a multi-file task.

Step 3: Execute and Monitor#

Once you approve the plan, watch the execution. Each action the agent takes is shown as it happens:

● Read src/spectral.py (2.3kB)
● Edit src/spectral.py
    + if arr.size == 0:
    +     return np.nan
● Read tests/test_spectral.py (1.8kB)
● Edit tests/test_spectral.py
    + def test_compute_coherence_empty_spectrum():
    +     result = compute_coherence(np.array([]))
    +     assert np.isnan(result)
● Bash: pytest tests/test_spectral.py -v
    PASSED tests/test_spectral.py::test_compute_coherence_empty_spectrum

You can interrupt at any point with Esc. If you see an action you didn’t intend (the agent is about to write to the wrong file, run a destructive command, etc.), interrupt immediately and clarify.

Step 4: Review the Diff#

Never accept changes without reading them. The diff is your last line of defense. Open a second terminal window and run:

git diff

Or use VS Code’s Source Control panel (⌃⇧G) to see a visual diff of each changed file. Read every changed line. Look for:

  • Logic that doesn’t match your intent

  • New dependencies that weren’t requested

  • Deleted lines that shouldn’t have been removed

  • Hard-coded values that should be configurable

  • Any security issues (unsanitized inputs, hard-coded credentials)

Step 5: Test#

Always run the full test suite after an AI-assisted change, not just the new tests:

pytest tests/ -v

Ask yourself:

  • Do the new tests actually test the right behaviour? (Tests can be wrong too)

  • Does the full suite still pass? (The change might have broken something elsewhere)

  • If there are no tests for a critical path, ask the model to add them

Step 6: Commit#

Keep commits small and descriptive. One commit per logical change:

git add -p                              # review each hunk before staging
git commit -m "Add guard for empty array in compute_coherence (returns np.nan)"

The -p flag in git add shows you each changed hunk and asks whether to stage it. This is another review opportunity and prevents accidentally committing unrelated changes.

Don’t let the agent commit

Let the agent write code. You commit. This forces a review step between generation and history. An AI-assisted commit that you haven’t reviewed is a commitment you may regret later.

Part 6: Advanced Features#

Hooks — Automate Your Quality Checks#

Hooks let you run shell commands automatically before or after specific agent actions. They’re defined in .claude/settings.json at your project root or in your global Claude Code settings.

Example: run the linter automatically after every file write:

{
  "hooks": {
    "postToolUse": [
      {
        "matcher": "Write",
        "hooks": [
          {
            "type": "command",
            "command": "ruff check --fix ${file}"
          }
        ]
      }
    ]
  }
}

With this hook, every time Claude Code writes a Python file, ruff runs automatically and fixes any style violations. The agent sees the result and can respond to any unfixable warnings.

Useful hooks for research code:

Hook

Trigger

Command

Auto-lint

After any .py write

ruff check --fix ${file}

Auto-format

After any .py write

black ${file}

Auto-test

After write to src/

pytest tests/ -x -q

Guard raw data

Before write to data/raw/

echo "ERROR: data/raw is immutable"; exit 1

The last example — a hook that blocks writes to data/raw/ — is a simple way to enforce a critical constraint automatically.

MCP Servers — Extending the Agent’s Reach#

MCP (Model Context Protocol) servers are plug-ins that give the agent new capabilities beyond the filesystem. They let the agent interact with databases, APIs, web browsers, and other services.

Add an MCP server:

claude mcp add <server-name>

Examples of available MCP servers:

Server

What it enables

@modelcontextprotocol/server-filesystem

Extended filesystem access

@modelcontextprotocol/server-github

Read/write GitHub issues, PRs

@modelcontextprotocol/server-postgres

Query a PostgreSQL database

@modelcontextprotocol/server-brave-search

Web search

mcp-jupyter

Execute and inspect Jupyter notebook cells

For neuroscience research, potentially useful servers include:

  • A database server pointing to your subject metadata database

  • A web search server for literature lookups during coding sessions

  • A Jupyter server for interactive notebook manipulation

MCP is still a young ecosystem. Check the official registry for new servers as the ecosystem grows.

Skills — Reusable Prompt Packages#

Skills are pre-packaged, reusable prompts (and sometimes code) that extend what Claude Code can do with a single slash command. Think of them as macros: you invoke a skill by name and the agent executes a well-defined, repeatable workflow.

Invoking a skill:

/skill-name [optional arguments]

For example, the commit skill generates a properly formatted git commit message and commits staged changes:

/commit

The review-pr skill fetches a pull request and produces a structured code review:

/review-pr 42

Why skills matter for research workflows:

Skills let you standardise the repetitive parts of your workflow and share them with collaborators. Instead of re-explaining your analysis conventions every session, encode them once as a skill. Some useful patterns:

Skill idea

What it would do

/new-subject

Scaffold a new subject analysis directory from a template

/qa-notebook

Run a checklist on a notebook (seed set? outputs cleared? paths relative?)

/summarize-results

Generate a results table from the current data/results/ directory

/check-reproducibility

Verify fixed seeds, pinned deps, and no absolute paths

Finding and installing skills:

The Claude Code ecosystem has a growing registry of community skills. You can browse available skills and install them via:

/find-skills [description of what you need]

You can also write your own skills in Markdown and place them in .claude/skills/ in your project or ~/.claude/skills/ for global use.

Skills vs CLAUDE.md vs Hooks

  • CLAUDE.md — persistent context the agent reads every session (project description, conventions, constraints)

  • Hooks — shell commands that fire automatically around agent actions (linting, testing, guardrails)

  • Skills — on-demand reusable workflows you invoke explicitly with /skill-name

Use all three together for a fully automated, consistent development environment.

Extended Thinking — Slow Down for Hard Problems#

For genuinely difficult problems — debugging a mysterious failure, choosing between architectural approaches, deciding on an appropriate statistical model — you can ask the model to reason more carefully before responding:

> Think carefully step-by-step before answering:
  My TRF model is systematically overfitting on subjects with fewer trials.
  I have: 50 subjects, 100–400 trials per subject, 64 EEG channels, 
  a 200ms TRF window at 1000 Hz (200 time-lag features × 64 channels = 12 800
  predictors). Ridge regression alpha is set by nested CV.
  What's the most principled approach to handle the imbalanced trial counts?

The model will reason through the problem — considering options, trade-offs, assumptions — before giving a recommendation. This doesn’t guarantee correctness, but it catches many errors that arise from hasty generation.

Extended thinking is not magic

The model is still generating plausible text, not computing a ground truth. For statistical decisions, use extended thinking to get a set of options with reasoning, then verify each option against a textbook or paper before implementing.

Part 7: Best Practices#

General Principles#

Treat the model as a capable junior collaborator — not an oracle. It can write clean code, explain concepts, draft documentation, and propose approaches. It lacks domain judgment, cannot verify its own outputs, and will not catch its own mistakes. Your job is to review, correct, and guide.

Keep sessions small and focused. One task per session. Long sessions accumulate context that can confuse the model, and long chains of edits make it harder to review individual changes. Use /clear when switching tasks.

Don’t prompt-engineer around mistakes — restart with a better ask. If the model fundamentally misunderstands what you want, don’t try to nudge it with increasingly elaborate follow-ups. Start a new session with a clearer, more specific request. A clean context beats a patched conversation.

The model’s confidence is not a signal of correctness. LLMs express uncertainty and certainty with the same fluent prose. A confident, detailed, wrong answer looks identical to a confident, detailed, correct answer. Treat all outputs as drafts requiring verification.

Software Engineering#

Write CLAUDE.md first. Before any session on a real project, write the project instruction file. Every minute spent on CLAUDE.md saves 10 minutes of corrections.

Request tests alongside implementation. Never ask for just the implementation. Always ask: “Write the function and the pytest tests for it.” Review the tests as carefully as the code — tests that pass but test the wrong thing are worse than no tests.

One change, one commit. Don’t let the agent make five changes and then commit them all. After each logical change: review the diff, run the tests, commit. Small commits make it easy to identify and revert problems.

Manually audit security-sensitive code. AI-generated code can introduce:

  • SQL injection (f-string queries with user input)

  • Path traversal (unsanitized file paths)

  • Hard-coded credentials

  • Insecure deserialization

Any code that handles user input, file paths, network requests, or authentication must be reviewed line by line. Seriously — do not skip this.

Don’t let the agent manage your secrets. If your project uses API keys, database passwords, or SSH keys, make sure these live in environment variables or a secrets manager — not in files the agent can read or write.

Research-Specific Best Practices#

Reproducibility is non-negotiable.

When asking the model to write analysis code, explicitly require:

  • Fixed random seeds using numpy’s recommended API

  • Pinned dependency versions (via requirements.txt or environment.yml)

  • Absolute paths or config-driven paths (no ../../data/ surprises)

# Ask for this:
import numpy as np
rng = np.random.default_rng(seed=42)
indices = rng.permutation(n_trials)

# Not this (no seed, deprecated API):
import numpy as np
np.random.shuffle(indices)

Use the model for boilerplate; think for yourself on the science.

AI excels at:

  • Loading and reshaping data

  • Writing repetitive analysis loops across subjects

  • Generating plot templates

  • Writing docstrings and type hints

AI is unreliable for:

  • Choosing the right statistical test for your data

  • Deciding what to control for in a regression

  • Interpreting a surprising result

  • Making claims about what your findings mean

These require domain knowledge and scientific judgment the model lacks.

Research: Statistics and Citations#

Verify statistical code independently.

LLMs are confidently wrong about statistics more often than in any other area. Common errors:

Error

Example

Wrong test for data type

Pearson correlation on rank data

Ignoring repeated measures

Two-sample t-test on within-subject comparisons

Wrong multiple comparisons correction

Bonferroni where FDR is appropriate

Incorrect degrees of freedom

Off-by-one errors in t-test df

Assumption violations silently ignored

ANOVA on clearly non-normal distributions

For every statistical test the model suggests, look up: (1) what the test assumes, (2) whether your data meets those assumptions, (3) the original paper or textbook source. Do not rely on the model’s explanation of why the test is appropriate.

Never copy LLM-generated text into manuscripts.

Use the model for:

  • Outlines and structural scaffolding

  • First drafts you’ll heavily rewrite

  • Improving clarity and flow of your text

  • Searching for relevant literature (but verify every paper it mentions)

Do not copy-paste:

  • Results sections (the model doesn’t have access to your numbers)

  • Conclusions or interpretations (these require your scientific judgment)

  • Methods descriptions (they may not accurately describe what you actually did)

Verify every cited paper.

When the model mentions a paper, before using it:

  1. Search for it on PubMed or Google Scholar

  2. Confirm the title, authors, and year are correct

  3. Read the relevant section and confirm it says what the model claims

Hallucinated references look completely real — plausible author names, real journals, correct year ranges. The only way to catch them is to look them up.

Part 8: Common Pitfalls#

Here are the most common ways that beginners — and experienced users — get into trouble. Each one comes with a concrete fix.


Pitfall 1: Not reading the diff

“I asked it to add a function and it deleted three other things.”

Fix: Always run git diff after every agent action. Use git add -p to stage changes interactively. Never use git add . after AI-assisted changes.


Pitfall 2: Too-long sessions

“After an hour of back-and-forth, it started contradicting its earlier code.”

Fix: Keep sessions short. One goal per session. Use /compact to compress history, or /clear to start fresh. Rely on CLAUDE.md rather than conversation history for persistent context.


Pitfall 3: Accepting confident wrong answers

“It was so sure about the statistical test that I didn’t check. It was wrong.”

Fix: The model’s confidence is not a signal of correctness. For anything high-stakes (statistics, citations, security), verify independently regardless of how confident the model sounds.


Pitfall 4: Asking for too much at once

“I asked it to refactor the whole pipeline and now nothing works.”

Fix: Break large tasks into small, reviewable steps. Each step should produce a working, committable state. If a task can’t be broken down, use plan mode and review each planned step carefully before proceeding.


Pitfall 5: Trusting generated tests

“The tests all passed but the analysis results were wrong.”

Fix: Tests can be wrong. Review each test: does it actually test the intended behaviour? Does it catch realistic failure modes? A test that trivially passes (e.g., assert result is not None) provides no protection.


Pitfall 6: Skipping CLAUDE.md

“It keeps using the wrong library / wrong convention / touching the wrong files.”

Fix: Write the CLAUDE.md. This is the most consistent predictor of session quality. Treat it as a living document — update it whenever you catch a persistent misunderstanding.


Pitfall 7: Using AI for the wrong parts of research

“I asked it to interpret my results and it gave me a great story. I published it. The reviewers found a flaw I hadn’t noticed because the AI narrative was so convincing.”

Fix: Use AI for technical implementation. Keep all scientific reasoning — what the results mean, what alternative explanations exist, what the limitations are — in your own hands.

Part 9: Exercises#

Exercise 1: Install and Explore#

  1. Install Claude Code and authenticate with your Anthropic account.

  2. Navigate to any existing Python project (your homework, a previous analysis, or a fresh directory with one .py file).

  3. Start a session and ask: “Describe the structure of this project and suggest one improvement.”

  4. Read the response. Note whether anything is incorrect or unhelpful, and think about what CLAUDE.md information would have prevented those issues.

Exercise 2: Write a CLAUDE.md#

For the project you used in Exercise 1 (or a new one), write a CLAUDE.md file. Include:

  • A one-paragraph project description

  • Directory structure

  • How to run the code and tests

  • At least two coding conventions specific to this project

  • At least one hard constraint (“do not modify X”)

Start a new Claude Code session after writing it. Ask the same question as in Exercise 1. Compare the quality of the response.

Exercise 3: The Full Cycle#

Using Claude Code, complete this full workflow:

  1. Enable plan mode (Shift+Tab)

  2. Ask: “Write a Python function called load_subject_data(path: str) -> dict that loads a CSV file and returns a dictionary with keys ‘subject_id’, ‘data’, and ‘n_trials’. Add appropriate error handling for missing files and malformed CSVs. Also write pytest tests covering: successful load, file not found, and a CSV missing the expected columns.”

  3. Read the plan carefully. Ask at least one clarifying question or request one modification before approving.

  4. After execution, run git diff and read every changed line.

  5. Run the tests: pytest -v

  6. Commit with an informative message.

What to watch for

  • Does the plan match what you asked for, or did the model interpret something differently?

  • Are the generated tests actually testing failure modes, or are they trivial?

  • Is the error handling specific enough (e.g., does it distinguish “file not found” from “malformed CSV”), or is it catching Exception everywhere?

  • After your clarifying question, did the model’s plan improve?

Exercise 4: Spot the Hallucination#

Ask Claude Code (or any LLM you have access to) the following:

What are the two most important papers on temporal response functions (TRFs) for
auditory neuroscience? Give me the full citation including DOI.

Then:

  1. Take the citations the model gives you.

  2. Search for each one on PubMed.

  3. For each: does the paper exist? Are the authors, title, journal, and year correct? Does the DOI resolve?

Record what you find. If the model hallucinated any details, note which parts were wrong. This exercise is designed to give you direct experience with hallucination before it matters in a real research context.

Summary#

Key Takeaways

Understanding LLMs:

  1. LLMs predict the next token — they are not knowledge bases and cannot verify their own outputs.

  2. Hallucination and sycophancy are structural properties, not bugs. Always verify independently.

  3. Context windows are finite; long sessions can cause the model to forget earlier constraints.

Getting productive: 4. Write CLAUDE.md before your first session — it’s the highest-leverage thing you can do. 5. Use plan mode for anything touching more than one file. 6. The cycle: ask → review plan → execute → read diff → test → commit. 7. Keep sessions short and tasks small.

For research: 8. Prioritize reproducibility: fixed seeds, pinned dependencies, no magic paths. 9. Verify every statistical test independently against the original source. 10. Never copy AI-generated text into manuscripts; never trust AI citations without verification. 11. Use AI for boilerplate and scaffolding. Keep scientific reasoning in your own hands.

Further Reading and Resources#