Weekend Project: The Self-Correcting Coder
Notebook: week2_reasoning/project_self_correcting_coder.ipynb
The Big Question
Can a smaller, cheaper model outperform a giant model by "thinking" more?
In this project, we put this hypothesis to the test by building a Self-Correcting Coding Agent that doesn't just write code once—it runs, tests, and fixes its own code iteratively.
Project Overview
We compare three approaches:
| Approach | Model | Strategy |
|---|---|---|
| Naive GPT-4o | gpt-4o | One-shot: "Write this code" |
| Agentic GPT-4o-mini | gpt-4o-mini | Reflection: "Write, Test, Fix, Repeat" |
| Agentic Llama-3 | llama-3-8b | Same pattern with open model |
Architecture: Reflection Pattern for Code
The self-correcting coder applies the reflection pattern specifically to code generation:
The Challenge Problem
To see the difference, we need a problem that is:
- Easy to get wrong on the first try
- Easy to verify (clear pass/fail tests)
Problem: Basic Calculator II
Implement a calculator to evaluate simple expressions like "3+2*2".
Constraints:
- Contains only non-negative integers,
+,-,*,/and spaces - Integer division truncates toward zero
- DO NOT use
eval()- implement parsing logic manually - Handle operator precedence (
*and/before+and-)
Why Hard?: Naive implementations often:
- Execute left-to-right (ignoring precedence)
- Mess up multi-digit number parsing
- Handle edge cases incorrectly
Test Cases
TEST_CASES = [
("3+2*2", 7), # Not 10! Precedence matters
(" 3/2 ", 1), # Truncate toward zero
(" 3+5 / 2 ", 5), # Spaces and mixed operators
("42", 42), # Single number
("100*2+12", 212), # Multi-digit numbers
("1-1+1", 1), # Left-to-right for same precedence
("2*3*4", 24), # Chained operators
]Implementation
Step 1: The Naive Approach
One-shot generation with the strongest model:
def naive_solve(model_name, problem):
response = client.chat.completions.create(
model=model_name,
messages=[
{"role": "system", "content": "You are a Python expert. Output ONLY code."},
{"role": "user", "content": problem}
]
)
return response.choices[0].message.content
# Run
code_gpt4o = naive_solve("gpt-4o", PROBLEM_DESCRIPTION)Step 2: Testing Infrastructure
Execute code and verify against test cases:
def run_and_test(code_str, test_cases):
try:
# Execute in isolated namespace
local_env = {}
exec(code_str, {}, local_env)
solve_func = local_env.get('calculate')
if not solve_func:
return False, "Function 'calculate' not found."
# Run test cases
for inp, expected in test_cases:
result = solve_func(inp)
if result != expected:
return False, f"Failed for '{inp}'. Expected {expected}, Got {result}"
return True, "All tests passed!"
except Exception as e:
return False, f"Error: {e}"Security Note: Only run exec() on code generated by your own agent in a sandboxed environment. Never execute untrusted code.
Step 3: The Agentic Approach
Self-correcting loop with a smaller model:
def agentic_solve(model_name, problem, max_retries=3):
history = [
{"role": "system", "content": "You are a Python expert. Output ONLY valid Python code. DO NOT USE eval()."},
{"role": "user", "content": problem}
]
for i in range(max_retries):
print(f"🔄 Attempt {i+1} ({model_name})...")
# 1. Generate
response = client.chat.completions.create(
model=model_name,
messages=history
)
code = clean_code(response.choices[0].message.content)
# 2. Test
success, feedback = run_and_test(code, TEST_CASES)
print(f" Result: {feedback}")
if success:
return code, i+1 # Return code and attempt count
# 3. Reflect - Add error to history
history.append({"role": "assistant", "content": code})
history.append({
"role": "user",
"content": f"Your code failed with: {feedback}\n\nPlease fix it."
})
return None, max_retriesThe Showdown: Results
Naive GPT-4o (One-Shot)
--- Naive GPT-4o ---
Pass? True (All tests passed!)The powerful model often gets it right in one try... but not always.
Agentic GPT-4o-mini (Self-Correcting)
--- Agentic GPT-4o-mini ---
🔄 Attempt 1 (gpt-4o-mini)...
Result: Failed for '3+2*2'. Expected 7, Got 10
🔄 Attempt 2 (gpt-4o-mini)...
Result: All tests passed!
✅ Solved in 2 attempts!The smaller model gets it wrong initially (ignoring precedence), but fixes itself when given the error feedback.
Cost Analysis
| Approach | Model | Calls | Est. Cost per Query |
|---|---|---|---|
| Naive | GPT-4o | 1 | ~$0.015 |
| Agentic | GPT-4o-mini × 3 | ≤3 | ~$0.0015 |
| Savings | ~10× cheaper |
Key Insight: Even with 3 retry attempts, the agentic approach with GPT-4o-mini is roughly 10× cheaper than a single GPT-4o call—and often more reliable because it can catch and fix errors.
Why Self-Correction Works
- Error Signals: Execution errors provide concrete, actionable feedback
- Focused Context: The model knows exactly what went wrong
- Iterative Refinement: Each attempt builds on previous learnings
- Safety Net: Catches silly mistakes that even powerful models make
Extend the Project
Try these challenges:
Add More Test Cases
Create edge cases that break naive implementations
Implement Timeout
Add execution timeout to catch infinite loops
Multi-Language Support
Extend to JavaScript, Rust, or other languages
Use Local Models
Try with Ollama + Llama-3 for completely local execution
Add Debugging Commentary
Make the agent explain what it changed and why
Advanced: Tree of Attempts
Instead of linear retry, try parallel exploration:
def parallel_solve(problem, num_candidates=3):
# Generate multiple solutions in parallel
candidates = [generate_code(problem) for _ in range(num_candidates)]
# Test all candidates
for code in candidates:
success, _ = run_and_test(code, TEST_CASES)
if success:
return code
# If all fail, pick best failure and iterate
return iterative_fix(best_candidate)Key Takeaways
- Agency beats raw power - A self-correcting loop often outperforms one-shot generation
- Verification is key - Clear pass/fail tests enable the reflection loop
- Cost efficiency - Smaller models + iteration can be cheaper AND more reliable
- Error context matters - Specific error messages drive better fixes
References & Further Reading
Academic Papers
-
"Self-Debugging: Teaching LLMs to Debug Their Predicted Program" - Chen et al., 2023
- arXiv:2304.05128 (opens in a new tab)
- Foundation for self-debugging code agents
-
"Reflexion: Language Agents with Verbal Reinforcement Learning" - Shinn et al., 2023
- arXiv:2303.11366 (opens in a new tab)
- General reflection framework applicable to coding
-
"CodeT: Code Generation with Generated Tests" - Chen et al., 2022
- arXiv:2207.10397 (opens in a new tab)
- Using generated tests for code verification
-
"InterCode: Standardizing and Benchmarking Interactive Coding" - Yang et al., 2023
- arXiv:2306.14898 (opens in a new tab)
- Benchmark for interactive coding agents
What's Next?
Congratulations on completing Week 2! You've learned:
- Chain of Thought prompting
- Plan-and-Execute agents
- Reflection and self-correction patterns
- Structured reasoning techniques
In Week 3, we'll explore Multi-Agent Systems—where multiple specialized agents collaborate to solve complex problems together.
Run the Notebook
jupyter notebook week2_reasoning/project_self_correcting_coder.ipynb