Arcaneum RDR Postmortem
How to measure improvement in a process.
One of my early goals for the RDR process was to enable postmortems so I can measure the drift between the original prompt and the final stable code. This way I could attempt to improve the process by identifying issues that repeat across RDRs.
You can see the postmortem docs in the Arcaneum repo. But I asked Claude to summarize the results.
From Claude:
Key Findings
17 post-mortems. 208 drift instances. One meta-lesson.
The three biggest sources of plan-vs-reality drift were authoring-process problems, not domain-knowledge gaps — meaning they’re fixable by changing how plans are written.
Key Findings
Code samples in plans are waste. Over-specified code was the #1 drift category (35 instances, 17%). Every single RDR included full class implementations that were completely rewritten. YAML schemas and config formats were never implemented. Code for deferred features was never used. The fix: specify interfaces and behaviors (signatures, contracts, data flow) rather than implementation code.
30-minute spikes would have prevented the biggest surprises. Unvalidated assumptions and framework API errors accounted for 45 drift instances across 13 RDRs. Documentation-only research was consistently wrong in detail — method signatures, filter syntax, model coverage all differed from docs. Searching dependency source code (not just reading docs) would have caught the majority.
GPU/memory management was the largest unplanned effort in the project. 8 of 17 RDRs required GPU acceleration or OOM recovery that wasn’t anticipated. One RDR described GPU integration as “low complexity: just add
device='mps'” — the implementation required OOM recovery, GPU poisoning detection, NaN/Inf validation, Metal timeout handling, and systematic garbage collection.“Day 2 operations” are not optional and not future work. List, delete, verify, backup, and info commands for every resource created were consistently deferred to “Future Enhancements” — then all were needed immediately. Day 2 ops represent 30-50% of the actual implementation effort for any feature that creates persistent resources.
Scope was systematically underestimated by 3-5x. CLI commands grew from 9 planned to 30+ built. A 50-line embedding wrapper became a 1,600-line GPU-aware system. A simple sync module became 140KB. The recurring cause: focusing on the happy path without counting error handling, Day 2 operations, and cross-cutting integrations.
Speculative performance targets were universally ignored. 13 of 17 RDRs included specific numeric targets (e.g., “100 PDFs/min”, “query latency < 50ms”). No benchmark infrastructure was created. No target was validated. They consumed authoring effort and created false precision.
CLI namespaces designed in isolation conflict with each other. 8 RDRs made independent naming decisions that required renames (
fulltext→indexes,find→search semantic,inject→store), breaking backwards compatibility.Existing infrastructure was routinely duplicated. Multiple RDRs designed new components when equivalent modules already existed in the codebase. A 5-minute audit of existing code would have prevented redundant designs.
The Meta-Lesson
Specify interfaces and behaviors, not implementations. Verify APIs against source code, not documentation. Plan for the full scope — Day 2 ops, error handling, and cross-cutting concerns — not just the happy path.
— Summarized from the Arcaneum RDR post-mortem synthesis, Claude Code conversation
With this feedback, and postmortems from other projects, I updated my templates in the RDR repo.
Possibly over time I can come up with some sort of index that represents the drift and see if it decreases.
I also would be interested in determining if there are multiple types of RDRs that might require different templates. For example RDRs that specify subsystems that support actual application features (like build tools), where application features would be defined in a different template. Currently I’m using my one-size-fits-all template.