Weekend Project: The Self-Correcting Coder

Notebook: week2_reasoning/project_self_correcting_coder.ipynb

The Big Question

Can a smaller, cheaper model outperform a giant model by "thinking" more?

In this project, we put this hypothesis to the test by building a Self-Correcting Coding Agent that doesn't just write code once—it runs, tests, and fixes its own code iteratively.

Project Overview

We compare three approaches:

Approach	Model	Strategy
Naive GPT-4o	`gpt-4o`	One-shot: "Write this code"
Agentic GPT-4o-mini	`gpt-4o-mini`	Reflection: "Write, Test, Fix, Repeat"
Agentic Llama-3	`llama-3-8b`	Same pattern with open model

Architecture: Reflection Pattern for Code

The self-correcting coder applies the reflection pattern specifically to code generation:

The Challenge Problem

To see the difference, we need a problem that is:

Easy to get wrong on the first try
Easy to verify (clear pass/fail tests)

Problem: Basic Calculator II Implement a calculator to evaluate simple expressions like "3+2*2".

Constraints:

Contains only non-negative integers, +, -, *, / and spaces
Integer division truncates toward zero
DO NOT use eval() - implement parsing logic manually
Handle operator precedence (* and / before + and -)

Why Hard?: Naive implementations often:

Execute left-to-right (ignoring precedence)
Mess up multi-digit number parsing
Handle edge cases incorrectly

Test Cases

TEST_CASES = [
    ("3+2*2", 7),       # Not 10! Precedence matters
    (" 3/2 ", 1),       # Truncate toward zero
    (" 3+5 / 2 ", 5),   # Spaces and mixed operators
    ("42", 42),         # Single number
    ("100*2+12", 212),  # Multi-digit numbers
    ("1-1+1", 1),       # Left-to-right for same precedence
    ("2*3*4", 24),      # Chained operators
]

Implementation

Step 1: The Naive Approach

One-shot generation with the strongest model:

def naive_solve(model_name, problem):
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": "You are a Python expert. Output ONLY code."},
            {"role": "user", "content": problem}
        ]
    )
    return response.choices[0].message.content
 
# Run
code_gpt4o = naive_solve("gpt-4o", PROBLEM_DESCRIPTION)

Step 2: Testing Infrastructure

Execute code and verify against test cases:

def run_and_test(code_str, test_cases):
    try:
        # Execute in isolated namespace
        local_env = {}
        exec(code_str, {}, local_env)
 
        solve_func = local_env.get('calculate')
        if not solve_func:
            return False, "Function 'calculate' not found."
 
        # Run test cases
        for inp, expected in test_cases:
            result = solve_func(inp)
            if result != expected:
                return False, f"Failed for '{inp}'. Expected {expected}, Got {result}"
 
        return True, "All tests passed!"
 
    except Exception as e:
        return False, f"Error: {e}"

⚠️

Security Note: Only run exec() on code generated by your own agent in a sandboxed environment. Never execute untrusted code.

Step 3: The Agentic Approach

Self-correcting loop with a smaller model:

def agentic_solve(model_name, problem, max_retries=3):
    history = [
        {"role": "system", "content": "You are a Python expert. Output ONLY valid Python code. DO NOT USE eval()."},
        {"role": "user", "content": problem}
    ]
 
    for i in range(max_retries):
        print(f"🔄 Attempt {i+1} ({model_name})...")
 
        # 1. Generate
        response = client.chat.completions.create(
            model=model_name,
            messages=history
        )
        code = clean_code(response.choices[0].message.content)
 
        # 2. Test
        success, feedback = run_and_test(code, TEST_CASES)
        print(f"   Result: {feedback}")
 
        if success:
            return code, i+1  # Return code and attempt count
 
        # 3. Reflect - Add error to history
        history.append({"role": "assistant", "content": code})
        history.append({
            "role": "user",
            "content": f"Your code failed with: {feedback}\n\nPlease fix it."
        })
 
    return None, max_retries

The Showdown: Results

Naive GPT-4o (One-Shot)

--- Naive GPT-4o ---
Pass? True (All tests passed!)

The powerful model often gets it right in one try... but not always.

Agentic GPT-4o-mini (Self-Correcting)

--- Agentic GPT-4o-mini ---
🔄 Attempt 1 (gpt-4o-mini)...
   Result: Failed for '3+2*2'. Expected 7, Got 10
🔄 Attempt 2 (gpt-4o-mini)...
   Result: All tests passed!
✅ Solved in 2 attempts!

The smaller model gets it wrong initially (ignoring precedence), but fixes itself when given the error feedback.

Cost Analysis

Approach	Model	Calls	Est. Cost per Query
Naive	GPT-4o	1	~$0.015
Agentic	GPT-4o-mini × 3	≤3	~$0.0015
Savings			~10× cheaper

Key Insight: Even with 3 retry attempts, the agentic approach with GPT-4o-mini is roughly 10× cheaper than a single GPT-4o call—and often more reliable because it can catch and fix errors.

Why Self-Correction Works

Error Signals: Execution errors provide concrete, actionable feedback
Focused Context: The model knows exactly what went wrong
Iterative Refinement: Each attempt builds on previous learnings
Safety Net: Catches silly mistakes that even powerful models make

Extend the Project

Try these challenges:

Add More Test Cases

Create edge cases that break naive implementations

Implement Timeout

Add execution timeout to catch infinite loops

Multi-Language Support

Extend to JavaScript, Rust, or other languages

Use Local Models

Try with Ollama + Llama-3 for completely local execution

Add Debugging Commentary

Make the agent explain what it changed and why

Advanced: Tree of Attempts

Instead of linear retry, try parallel exploration:

def parallel_solve(problem, num_candidates=3):
    # Generate multiple solutions in parallel
    candidates = [generate_code(problem) for _ in range(num_candidates)]
 
    # Test all candidates
    for code in candidates:
        success, _ = run_and_test(code, TEST_CASES)
        if success:
            return code
 
    # If all fail, pick best failure and iterate
    return iterative_fix(best_candidate)

Key Takeaways

Agency beats raw power - A self-correcting loop often outperforms one-shot generation
Verification is key - Clear pass/fail tests enable the reflection loop
Cost efficiency - Smaller models + iteration can be cheaper AND more reliable
Error context matters - Specific error messages drive better fixes

References & Further Reading

Self-Debug Paper Reflexion Paper CodeT Paper

Academic Papers

"Self-Debugging: Teaching LLMs to Debug Their Predicted Program" - Chen et al., 2023
- arXiv:2304.05128 (opens in a new tab)
- Foundation for self-debugging code agents
"Reflexion: Language Agents with Verbal Reinforcement Learning" - Shinn et al., 2023
- arXiv:2303.11366 (opens in a new tab)
- General reflection framework applicable to coding
"CodeT: Code Generation with Generated Tests" - Chen et al., 2022
- arXiv:2207.10397 (opens in a new tab)
- Using generated tests for code verification
"InterCode: Standardizing and Benchmarking Interactive Coding" - Yang et al., 2023
- arXiv:2306.14898 (opens in a new tab)
- Benchmark for interactive coding agents

What's Next?

Congratulations on completing Week 2! You've learned:

Chain of Thought prompting
Plan-and-Execute agents
Reflection and self-correction patterns
Structured reasoning techniques

In Week 3, we'll explore Multi-Agent Systems—where multiple specialized agents collaborate to solve complex problems together.

Run the Notebook

jupyter notebook week2_reasoning/project_self_correcting_coder.ipynb

05. Advanced (Self-RAG)Overview