Part 1

Agentic Development in the Wild

This project was purpose-built to study the emerging practice of agentic development—integrating AI assistance throughout the software lifecycle. The objective wasn't just to ship a working CLI tool, but to rigorously evaluate how modular architectures, shared configurations, and resilient design can support meaningful collaboration with large language models (LLMs).

The result was Tilecraft, a production-ready vector tile generator built with an AI-augmented workflow. Along the way, we codified concrete architectural patterns, documented code and test metrics, and assessed real-world LLM capabilities and limitations through direct observation.

Context: Building Tilecraft

Tilecraft ingests OSM data, transforms it into a vector tile schema, and renders it in a custom cartographic style using open standards. At its core, it's a CLI app built to be fast, modular, and human-friendly.

Agentic development refers to a modular, AI-augmented workflow that integrates LLM agents as co-workers across the software lifecycle—from design and config generation to real-time code fixes and test authoring.

Key Principles Discovered

1. Modularize Everything (for the Agent, Not Just the Human)

AI tools work better with boundaries. We quickly learned that long monolithic files or loosely scoped functions led to poor LLM performance and lower-quality suggestions. Refactoring our codebase into tight, single-responsibility modules improved LLM interactions dramatically.

2. Configuration is Collaboration

We started with human-readable YAML configs for data sources, layer schemas, and styling parameters. But midway through, we realized that these configs also served as a shared memory space for the AI agents.

3. Use Progressive Enhancement, but for Intelligence

Rather than expect the AI to write perfect code the first time, we scaffolded workflows that let it propose partial solutions: write function shells with TODOs, auto-insert logging wrappers, fallback to mocked outputs when real data fails.

4. Graceful Degradation is Not Optional

LLMs will get things wrong. They will hallucinate field names, misinterpret schemas, or propose unparseable JSON. The key is to design with failure as a first-class citizen.

Documented Results

4,632 Lines of production code
27 Automated tests written
64% Test coverage achieved
100% Pipeline functionality

Qualitative Observations

While we lack baseline measurements for precise quantification, the development process exhibited several notable characteristics:

  • Development velocity felt accelerated compared to similar past projects, though without controlled measurement
  • AI generated substantial boilerplate code for CLI scaffolding, error handling, and documentation
  • Test-first approach reduced iteration cycles by catching AI-generated errors early
  • Rich documentation was AI-assisted but required human review and refinement
  • Error handling was more comprehensive due to AI's systematic approach to edge cases
Part 2

Architecture Patterns for Human-AI Collaboration

Agentic development demands more than clever prompting. It requires intentional system design that accommodates partial automation, graceful failure, and clear task boundaries.

1. Dual Loop Architecture

We adopted a dual feedback loop model: one loop for human-driven development, and another for AI-generated artifacts. Both loops interacted with a shared config state and logging system.

2. Structured Prompts via YAML Templates

Rather than write prompts ad hoc, we stored them as parameterized YAML templates. These templates could be programmatically populated with context and included a comment field for human notes.

3. Agent-Writable Configs with Human Locks

We used config files as the shared working memory for the system—but guarded critical fields with human locks. LLMs could read and suggest edits, but flagged changes for human approval.

4. Self-Scaffolding CLI Interface

We structured the CLI to emit AI-scaffoldable hints by default. Running tilecraft build would check for missing configs, emit markdown-style prompt suggestions if failure occurred, and offer retry with --with-agent to auto-fill missing elements.

5. Observability Layer for AI Behavior

We added structured logging for prompt content and AI responses, execution time and error rates by agent task, and human overrides and final decisions. All logs were indexed and queryable via a local dashboard.

Part 3

Testing and Quality Assurance for AI-Enhanced Systems

Testing AI-generated code adds complexity. But with deliberate patterns and workflows, we maintained stability across the pipeline and enforced standards that AI alone wouldn't catch.

What We Actually Did

  • Wrote 27 automated tests targeting critical path components
  • Achieved 64% test coverage on production modules
  • Implemented test-first design for many agentic features
  • Used LLMs to generate and explain tests—but always with human review
  • Built a lightweight QA checklist for verifying AI-assisted pull requests

Key Testing Patterns

Pattern 1: Test the Boundaries, Not Just the Happy Path

AI often generates plausible-looking code—but it's the edge cases where things break. We adopted a testing approach that emphasized boundary conditions: missing input files, unexpected schema fields, invalid config parameters, network timeouts.

Pattern 2: Generate Test Skeletons, Then Refactor

We frequently asked the AI to write test scaffolds. The results were hit-or-miss—but still useful for structure, fixtures, and naming. But we always reviewed the logic and added context-specific assertions.

Pattern 3: Validate Prompt Chains and Agent Steps

Many components relied on prompt-chain logic. We tested these by running dry simulations, capturing intermediate outputs, and validating schema consistency across steps.

Pattern 4: Use Logging as Test Evidence

Some AI-generated behavior wasn't easily testable with asserts. Instead, we relied on structured logs as verification: was the right fallback invoked? Did the agent use the correct schema template?

Pattern 5: Human-in-the-Loop QA Checklist

We adopted a simple checklist for reviewing agent-generated pull requests: CLI functionality, fallback behavior, realistic sample testing, prompt readability, and schema contract compliance.

Part 4

From 0 to Production: Technical Deep-Dive

This section offers a granular look at what it actually took to go from zero to production with an AI-augmented workflow focused on data science infrastructure—from ingestion to export.

The Stack

  • Python (3.11) for all pipeline components
  • Pydantic for data validation and configuration schemas
  • FastAPI for exposing processing steps as services
  • Tippecanoe and tilemaker for vector tile generation
  • DuckDB and GeoPandas for local data handling and ETL
  • Jinja2 for templating prompts and config files
  • Rich and Typer for building a modern CLI

Pipeline Overview

The data science workflow had five major stages:

  1. Data Ingestion: Download and stage OSM extracts, convert to GeoJSON or FlatGeobuf, extract relevant features
  2. Schema Mapping: Normalize attributes across layers, use LLM to propose style schemas, validate with Pydantic
  3. Tiling + Caching: Use Tippecanoe to generate .mbtiles, apply caching logic, compress outputs
  4. Export + Serving: Host tiles locally or upload to S3-compatible storage
  5. Documentation: Auto-generate Markdown docs, create zipped bundles with README and logs

Where the AI Helped

  • Schema proposals: GPT-4 reliably generated proposed JSON schema mappings for common OSM tags
  • Error handling templates: AI helped generate boilerplate try/except scaffolds
  • Markdown documentation: Agent summarized config contents and output formats
  • Naming + Refactoring: AI-suggested class and function names following role-based prompts

Where the AI Struggled

  • Complex conditional logic: Multi-field exceptions made AI suggestions brittle
  • Multi-file context: Without clear scaffolding, LLMs lost track of dependencies
  • Caching and fingerprinting: Schema-aware deduplication had to be hand-coded
4,632 Lines of Code across 9 modules
27 Tests with 64% coverage
100% Outputs versioned with logs
Part 5

Reflections + Recommendations: The Agentic Playbook

This final piece distills what we've learned into a practical playbook for developers and teams experimenting with AI-augmented workflows. This is not a hype piece—it's a synthesis grounded in actual practices, observations, and limitations.

🧭 Core Principles

  1. AI Is a Tool, Not a Teammate: Treat it like a powerful code-generation utility—not a replacement for design judgment, testing, or architecture.
  2. Agentic Workflows Require Structure: Modularize code, externalize configs, and version prompts. Reproducibility beats cleverness.
  3. Human-in-the-Loop Is Not Optional: Every LLM-generated artifact must pass through a human QA checkpoint.
  4. Failures Are Features, If You Design for Them: Expect errors, hallucinations, and ambiguity. Build fallback logic in from the start.
  5. Documentation Is Part of the System: Use AI to help write it—but make documentation part of the workflow.

🧱 What to Build

Build These:

  • ✅ Config schemas that AI can both read and write
  • ✅ A CLI that emits helpful error states and scaffoldable prompts
  • ✅ A prompt engine with context injection and variable substitution
  • ✅ Logging utilities that include AI provenance
  • ✅ Agent-aware test harnesses and fallback scripts

Avoid These:

  • ❌ Opaque prompt chains with no logging
  • ❌ Storing LLM outputs in production without human review
  • ❌ AI agents with access to destructive file operations

🧩 Where to Start

If you're just beginning to explore agentic development:

  1. Pick a simple CLI-based tool you want to build
  2. Define the core config schema and user inputs
  3. Write out your AI prompts as YAML templates
  4. Use AI to scaffold, but manually test everything
  5. Add structured logging, doc generation, and fallback behavior
The future of AI-augmented software development will not be about autonomous agents building software alone. It will be about designing systems where humans and AI interact fluidly, accountably, and effectively.
Executive Summary

Key Takeaways & Presentation Outline

This executive summary captures the essential insights from the Tilecraft project for teams and decision-makers evaluating agentic development approaches.

Why This Work Matters

  • Software development is changing: LLMs are becoming part of the standard toolchain
  • Developer teams need structured approaches, not just experimentation
  • Agentic development represents a measurable, reproducible pattern worth studying

What Is Agentic Development?

  • Definition: Modular, AI-augmented workflows with human oversight
  • Goal: Augment—not automate—the developer process
  • Structure: Human ↔ Config ↔ AI feedback loops with shared state

Case Study Results: Tilecraft

  • CLI tool to convert OSM → Vector Tiles
  • Built from scratch using AI throughout the lifecycle
  • Emphasis on reproducibility, modularity, and testability
  • Documented metrics: 4,632 LOC, 27 tests, 64% coverage
  • Qualitative outcome: Development felt accelerated with comprehensive AI assistance

Architecture Patterns That Worked

  • Dual loop model: Human + AI feedback cycles
  • YAML-driven prompt templates with version control
  • AI-writable config with human-locked critical fields
  • Self-scaffolding CLI + comprehensive observability layer

Testing & QA Approach

  • Test-first scaffolds with AI-generated test shells
  • Focus on edge cases and boundary conditions
  • Logs-as-verification for dynamic AI behavior
  • Human-in-the-loop QA checklist for all AI outputs

The Agentic Playbook Summary

  • 5 principles: AI is a tool, design for failure, human-in-loop mandatory
  • 5 things to build: versioned prompts, config schemas, observability
  • 5 things to avoid: opaque agents, destructive operations, no QA

What Still Needs Research

  • Quantitative productivity benchmarks across different project types
  • Prompt versioning and agent performance tracking standards
  • Long-term maintainability patterns for hybrid human/AI systems
← Back to Services