Agentic Development Case Study

Part 1

Agentic Development in the Wild

This project was purpose-built to study the emerging practice of agentic development—integrating AI assistance throughout the software lifecycle. The objective wasn't just to ship a working CLI tool, but to rigorously evaluate how modular architectures, shared configurations, and resilient design can support meaningful collaboration with large language models (LLMs).

The result was Tilecraft, a production-ready vector tile generator built with an AI-augmented workflow. Along the way, we codified concrete architectural patterns, documented code and test metrics, and assessed real-world LLM capabilities and limitations through direct observation.

Context: Building Tilecraft

Tilecraft ingests OSM data, transforms it into a vector tile schema, and renders it in a custom cartographic style using open standards. At its core, it's a CLI app built to be fast, modular, and human-friendly.

Agentic development refers to a modular, AI-augmented workflow that integrates LLM agents as co-workers across the software lifecycle—from design and config generation to real-time code fixes and test authoring.

Key Principles Discovered

1. Modularize Everything (for the Agent, Not Just the Human)

AI tools work better with boundaries. We quickly learned that long monolithic files or loosely scoped functions led to poor LLM performance and lower-quality suggestions. Refactoring our codebase into tight, single-responsibility modules improved LLM interactions dramatically.

2. Configuration is Collaboration

We started with human-readable YAML configs for data sources, layer schemas, and styling parameters. But midway through, we realized that these configs also served as a shared memory space for the AI agents.

3. Use Progressive Enhancement, but for Intelligence

Rather than expect the AI to write perfect code the first time, we scaffolded workflows that let it propose partial solutions: write function shells with TODOs, auto-insert logging wrappers, fallback to mocked outputs when real data fails.

4. Graceful Degradation is Not Optional

LLMs will get things wrong. They will hallucinate field names, misinterpret schemas, or propose unparseable JSON. The key is to design with failure as a first-class citizen.

Documented Results

4,632 Lines of production code

27 Automated tests written

64% Test coverage achieved

100% Pipeline functionality

Qualitative Observations

While we lack baseline measurements for precise quantification, the development process exhibited several notable characteristics:

Development velocity felt accelerated compared to similar past projects, though without controlled measurement
AI generated substantial boilerplate code for CLI scaffolding, error handling, and documentation
Test-first approach reduced iteration cycles by catching AI-generated errors early
Rich documentation was AI-assisted but required human review and refinement
Error handling was more comprehensive due to AI's systematic approach to edge cases

Part 2

Architecture Patterns for Human-AI Collaboration

Agentic development demands more than clever prompting. It requires intentional system design that accommodates partial automation, graceful failure, and clear task boundaries.

1. Dual Loop Architecture

We adopted a dual feedback loop model: one loop for human-driven development, and another for AI-generated artifacts. Both loops interacted with a shared config state and logging system.

2. Structured Prompts via YAML Templates

Rather than write prompts ad hoc, we stored them as parameterized YAML templates. These templates could be programmatically populated with context and included a comment field for human notes.

3. Agent-Writable Configs with Human Locks

We used config files as the shared working memory for the system—but guarded critical fields with human locks. LLMs could read and suggest edits, but flagged changes for human approval.

4. Self-Scaffolding CLI Interface

We structured the CLI to emit AI-scaffoldable hints by default. Running tilecraft build would check for missing configs, emit markdown-style prompt suggestions if failure occurred, and offer retry with --with-agent to auto-fill missing elements.

5. Observability Layer for AI Behavior

We added structured logging for prompt content and AI responses, execution time and error rates by agent task, and human overrides and final decisions. All logs were indexed and queryable via a local dashboard.

Part 3

Testing and Quality Assurance for AI-Enhanced Systems

Testing AI-generated code adds complexity. But with deliberate patterns and workflows, we maintained stability across the pipeline and enforced standards that AI alone wouldn't catch.

What We Actually Did

Wrote 27 automated tests targeting critical path components
Achieved 64% test coverage on production modules
Implemented test-first design for many agentic features
Used LLMs to generate and explain tests—but always with human review
Built a lightweight QA checklist for verifying AI-assisted pull requests

Key Testing Patterns

Pattern 1: Test the Boundaries, Not Just the Happy Path

AI often generates plausible-looking code—but it's the edge cases where things break. We adopted a testing approach that emphasized boundary conditions: missing input files, unexpected schema fields, invalid config parameters, network timeouts.

Pattern 2: Generate Test Skeletons, Then Refactor

We frequently asked the AI to write test scaffolds. The results were hit-or-miss—but still useful for structure, fixtures, and naming. But we always reviewed the logic and added context-specific assertions.

Pattern 3: Validate Prompt Chains and Agent Steps

Many components relied on prompt-chain logic. We tested these by running dry simulations, capturing intermediate outputs, and validating schema consistency across steps.

Pattern 4: Use Logging as Test Evidence

Some AI-generated behavior wasn't easily testable with asserts. Instead, we relied on structured logs as verification: was the right fallback invoked? Did the agent use the correct schema template?

Pattern 5: Human-in-the-Loop QA Checklist

We adopted a simple checklist for reviewing agent-generated pull requests: CLI functionality, fallback behavior, realistic sample testing, prompt readability, and schema contract compliance.

Part 4

From 0 to Production: Technical Deep-Dive

This section offers a granular look at what it actually took to go from zero to production with an AI-augmented workflow focused on data science infrastructure—from ingestion to export.

The Stack

Python (3.11) for all pipeline components
Pydantic for data validation and configuration schemas
FastAPI for exposing processing steps as services
Tippecanoe and tilemaker for vector tile generation
DuckDB and GeoPandas for local data handling and ETL
Jinja2 for templating prompts and config files
Rich and Typer for building a modern CLI

Pipeline Overview

The data science workflow had five major stages:

Data Ingestion: Download and stage OSM extracts, convert to GeoJSON or FlatGeobuf, extract relevant features
Schema Mapping: Normalize attributes across layers, use LLM to propose style schemas, validate with Pydantic
Tiling + Caching: Use Tippecanoe to generate .mbtiles, apply caching logic, compress outputs
Export + Serving: Host tiles locally or upload to S3-compatible storage
Documentation: Auto-generate Markdown docs, create zipped bundles with README and logs

Where the AI Helped

Schema proposals: GPT-4 reliably generated proposed JSON schema mappings for common OSM tags
Error handling templates: AI helped generate boilerplate try/except scaffolds
Markdown documentation: Agent summarized config contents and output formats
Naming + Refactoring: AI-suggested class and function names following role-based prompts

Where the AI Struggled

Complex conditional logic: Multi-field exceptions made AI suggestions brittle
Multi-file context: Without clear scaffolding, LLMs lost track of dependencies
Caching and fingerprinting: Schema-aware deduplication had to be hand-coded

4,632 Lines of Code across 9 modules

27 Tests with 64% coverage

100% Outputs versioned with logs

Part 5

Reflections + Recommendations: The Agentic Playbook

This final piece distills what we've learned into a practical playbook for developers and teams experimenting with AI-augmented workflows. This is not a hype piece—it's a synthesis grounded in actual practices, observations, and limitations.

🧭 Core Principles

AI Is a Tool, Not a Teammate: Treat it like a powerful code-generation utility—not a replacement for design judgment, testing, or architecture.
Agentic Workflows Require Structure: Modularize code, externalize configs, and version prompts. Reproducibility beats cleverness.
Human-in-the-Loop Is Not Optional: Every LLM-generated artifact must pass through a human QA checkpoint.
Failures Are Features, If You Design for Them: Expect errors, hallucinations, and ambiguity. Build fallback logic in from the start.
Documentation Is Part of the System: Use AI to help write it—but make documentation part of the workflow.

🧱 What to Build

Build These:

✅ Config schemas that AI can both read and write
✅ A CLI that emits helpful error states and scaffoldable prompts
✅ A prompt engine with context injection and variable substitution
✅ Logging utilities that include AI provenance
✅ Agent-aware test harnesses and fallback scripts

Avoid These:

❌ Opaque prompt chains with no logging
❌ Storing LLM outputs in production without human review
❌ AI agents with access to destructive file operations

🧩 Where to Start

If you're just beginning to explore agentic development:

Pick a simple CLI-based tool you want to build
Define the core config schema and user inputs
Write out your AI prompts as YAML templates
Use AI to scaffold, but manually test everything
Add structured logging, doc generation, and fallback behavior

The future of AI-augmented software development will not be about autonomous agents building software alone. It will be about designing systems where humans and AI interact fluidly, accountably, and effectively.

Executive Summary

Key Takeaways & Presentation Outline

This executive summary captures the essential insights from the Tilecraft project for teams and decision-makers evaluating agentic development approaches.

Why This Work Matters

Software development is changing: LLMs are becoming part of the standard toolchain
Developer teams need structured approaches, not just experimentation
Agentic development represents a measurable, reproducible pattern worth studying

What Is Agentic Development?

Definition: Modular, AI-augmented workflows with human oversight
Goal: Augment—not automate—the developer process
Structure: Human ↔ Config ↔ AI feedback loops with shared state

Case Study Results: Tilecraft

CLI tool to convert OSM → Vector Tiles
Built from scratch using AI throughout the lifecycle
Emphasis on reproducibility, modularity, and testability
Documented metrics: 4,632 LOC, 27 tests, 64% coverage
Qualitative outcome: Development felt accelerated with comprehensive AI assistance

Architecture Patterns That Worked

Dual loop model: Human + AI feedback cycles
YAML-driven prompt templates with version control
AI-writable config with human-locked critical fields
Self-scaffolding CLI + comprehensive observability layer

Testing & QA Approach

Test-first scaffolds with AI-generated test shells
Focus on edge cases and boundary conditions
Logs-as-verification for dynamic AI behavior
Human-in-the-loop QA checklist for all AI outputs

The Agentic Playbook Summary

5 principles: AI is a tool, design for failure, human-in-loop mandatory
5 things to build: versioned prompts, config schemas, observability
5 things to avoid: opaque agents, destructive operations, no QA

What Still Needs Research

Quantitative productivity benchmarks across different project types
Prompt versioning and agent performance tracking standards
Long-term maintainability patterns for hybrid human/AI systems