Case Studies
Four systems from the same platform, each a different engineering challenge. Five years of compounding ownership at Roof Maxx Connect.
Idempotent Batch Processing
Leads Invoice Pipeline · Laravel · Redis · QuickBooks API
Automated weekly invoice generation for 200-300 franchise dealers. One job per dealer, 4-layer idempotency, zero duplicates when accounting accidentally ran billing twice in production.
The Problem
Accounting was manually creating invoices in spreadsheets and uploading them to QuickBooks. It took hours and was error-prone. I automated weekly leads billing for the entire dealer network, but the constraints were real: an inherited database schema I couldn't modify, a live sync layer constantly reading and writing to the same tables, and years of manually-created QuickBooks data with no uniqueness enforcement on invoice numbers.
Job Boundaries: One Per Dealer
Deals are grouped by dealer before job creation. Each job receives all billable deals for a single dealer and produces a single invoice. Not one job per deal (too granular, you'd need coordination to assemble shared invoices). Not one giant batch (a single failure kills everything).
Dealer is the right boundary because it matches the business domain (one invoice per dealer per week), provides fault isolation (bad data for dealer A doesn't block dealer B), and enables natural parallelism with no shared state during creation. Five concurrent workers process 200-300 dealer jobs in minutes.
4-Layer Idempotency
Every layer of the system is re-runnable. Run billing twice and you get the same result. Each layer exists because of a specific constraint, not because I read a blog post about idempotency patterns.
Line Item Existence Check
Before creating a line item, the job checks if one already exists for that deal and dealer. Application-level dedup because the inherited schema had no unique constraint, and the table was shared across reporting, sync, and UI systems.
Invoice Number Counter with Row-Level Lock
Invoice numbers are generated from a separate counter table
with
lockForUpdate(). Why a separate table? Locking rows on the main invoices
table would block the sync layer (webhooks and polling) that's
constantly writing to it. The counter table is purpose-built.
Locking a row there only blocks other invoice number
generation, not reads or writes to invoices.
QuickBooks Send Guard
Before pushing an invoice to the QB API, the job checks if a
quickbooks_id
already exists. If QB already has it, the send is skipped.
Prevents duplicate invoices in the accounting system.
Sync-Side Counter Reconciliation
Every QuickBooks sync operation also bumps the invoice number counter. Even when the system can't fully process an incoming invoice (dealer not found locally, for example), it still protects the sequence. Manually-created invoices in QB can't cause collisions because the counter self-heals on every sync.
The Burned Invoice Number Trade-off
Invoice numbers are allocated in a short critical section and committed immediately. If the job later finds no valid line items for a dealer, the empty invoice is deleted, but the number is gone. A gap in the sequence.
The alternative was wrapping everything in a single transaction: generate the number, create the invoice, insert all line items, then commit or rollback. But the lock on the counter row would be held for the entire duration of N line item inserts. Every other dealer job running in parallel would block on that same row.
The cost: invoice numbers have occasional gaps. The gain: five workers run in parallel instead of queuing behind one lock. Stakeholders accepted the trade-off.
Production Validation
The idempotency design was validated in production when an accounting team member accidentally triggered the billing batch twice. The TDD test for this exact scenario was written before it happened.
Help me huge issue. Batch is running again can you stop it. It already ran through once.
A market was accidentally deleted from the pricing spreadsheet. Only affects one dealer, but I already need to reprice and re-run. Some invoices already went out.
Clicking the job twice shouldn't be an issue, but if there are price adjustments, it's probably best for me to delete the invoices before they are sent to QB, and then you can remake the invoices from step #1 again.
If you delete just the invoices that were affected and re-run the batch, it will not duplicate any line items that have already been made, it will re-create what was deleted.
I built the system so we could technically re-run billing for past weeks if there were any data issues when creating the leads. So that required making sure that nothing would be duplicated!
THAT IS SO AWESOME!
It shows 69 jobs processing but only one dealer should have changes, correct?
Check "Pending QB Send" — looks like there are 4 invoices there now. It shows a job for each dealer, but would only recreate missing line items like I said.
Wow that is so cool
Hindsight
- › If I could modify the schema, I'd add a composite unique constraint to make the idempotency guarantee database-enforced rather than application-enforced. Application-level worked, but database-level is stronger.
- › I'd add a circuit breaker on QB API calls rather than relying purely on retry with exponential backoff.
-
›
The upsert logic uses last-write-wins. The system already
stores
SyncTokenand uses it for deletes and sparse updates, but the reconciliation path doesn't compare it before overwriting. Adding that check would prevent unnecessary overwrites during the webhook-plus-polling race condition.
Billing Pipeline
$155M+ Processed · 82K Invoices · QuickBooks · Laravel
Rebuilt billing from CSV uploads to webhook-driven QuickBooks sync. $155M+ processed, 82K invoices, three invoice models. Found a campaign attribution design flaw during code research, months before it surfaced in production.
What I Inherited
The billing "system" was a CSV upload flow. Someone exported invoice data from QuickBooks, uploaded it through an admin page, and records were created in the local database. No OAuth, no API calls, no real-time sync, no way to push invoices back to QuickBooks. Payment status was a mystery unless someone re-uploaded a newer CSV.
What I Built
I built the QuickBooks OAuth2 integration from scratch, then layered on bidirectional API sync: invoice creation, webhook-driven status updates, payment reconciliation through Armatic, and automated accounting reports. The system evolved through three invoice models as the business grew: corporate dealer billing, consumer invoicing, and per-dealer franchise QuickBooks connections for field service operations.
Two independent fields on every invoice. QB sync status tracks delivery; payment status tracks collection. Both are updated via webhooks from QuickBooks.
Armatic operates independently against QuickBooks — RMC never calls Armatic directly. All status updates flow back through QB webhooks.
The Campaign Pricing Bug
I was doing research for a pricing update when I inherited the repricing tool. Reading through it, I realized it was built on an assumption that would break.
The pricing engine looks up every zone a deal's zipcode belongs
to, finds active campaigns in each zone, calculates the
cost-per-lead for each campaign, and bills the deal at the highest
CPL. Then it stores a single
billing_campaign_id
on the deal. This works when there's one campaign per DMA.
When marketing runs two campaigns in the same DMA, billing is still correct (the deal gets priced once at the higher CPL, dealers pay the right amount), but attribution breaks. Only one campaign gets the foreign key. Reports query by geography, not by attribution, so both campaigns "claim" every deal in overlapping zipcodes. Lead counts are inflated. CPL calculations show both campaigns as cheaper than reality.
Report: 10 leads · $150 ÷ 10 = $15 CPL
Correct
Deals attributed to TV ($100 spend, higher CPL). But reports query by geography, not attribution.
DMA report: 20 leads · $150 ÷ 20 = $7.50 CPL
Reality: 10 deals, $150 total, true CPL = $15
I discovered this during code research and documented the risk months before it became a production problem. When someone in marketing eventually created overlapping campaigns, the exact scenario I'd documented played out: budget allocation decisions were being made on flawed CPL data, while invoices remained correct. A real fix requires multi-campaign attribution with a pivot table, fractional weights, and reports rewritten to query attribution records instead of geography. It's one of the reasons the 2.0 architecture exists.
Trade-Offs and Hindsight
-
›
Dual invoice models. I built separate models for corporate billing and franchise dealer billing rather than extending the existing one. The legacy model had a mountain of reporting code on top of it, and the franchise system had completely different requirements (per-dealer QB connections, estimate-to-invoice workflows). In a greenfield, I'd use a single polymorphic invoice table. Given the constraints, the split was the right call.
-
›
Config-driven pricing. Week-based price adjustments and QB product IDs lived in PHP config files. Version-controlled and auditable, but it meant random mid-day deployments for pricing changes that should have been a database update. I was early in my career and deferred to business judgment on tooling priorities. I'd push back harder now.
-
›
Integration contracts. The QB, Armatic, and HubSpot integrations all do bidirectional sync with varying reliability. Each system can update the others, leading to sync drift when any link fails silently. In a redesign, I'd define clear data ownership: one authoritative source per entity, with other systems subscribing via events and explicit conflict resolution.
Async-First Architecture
20K Jobs/Day · 8 Supervisors · Laravel Horizon · Redis
Single-server monolith crashing under load. Migrated to async-first with dedicated worker infrastructure. 8 specialized supervisors, 20K jobs/day, zero peak-hour crashes. Each supervisor was justified by a specific production failure, not upfront design.
The Problem
RMC ran on a single DigitalOcean server: web, database, and all background work on one box. Controller actions handled everything synchronously in the request thread. A controller calling three services would hold the connection open for all three. If any service was slow, the request timed out. When enough requests timed out simultaneously, the server ran out of connections and the whole application went down.
Diagnosing these crashes was nearly impossible. The actual root cause of several outages turned out to be artisan commands that a teammate had built for reporting. They were run manually in production with no proper termination logic. They continued running as worse-than-zombie processes: still actively consuming memory with no kill mechanism. The single-server architecture made it impossible to isolate which process type was causing resource exhaustion.
The Decision
I evaluated three options. Optimizing the synchronous path would treat symptoms without solving resource isolation or the zombie process problem. Running a basic queue on the same server would still let a runaway worker take down the web server. The third option, async-first with dedicated worker infrastructure, provided full resource isolation and independent scaling, but added real complexity: queue topology, worker management, progress tracking for user-facing operations, and team pushback about "overcomplicating things."
I chose the third option. The team saw queued jobs as overcomplication. They were right that it adds complexity. They were wrong that the alternative was simpler. The alternative was timeouts, zombie processes, and cascading crashes that were harder to debug and harder to fix. The complexity was already there. The queue system made it visible and manageable.
Queue Topology
I added priority tiers incrementally as problems surfaced. Each supervisor exists because of a specific production incident, not because I designed 8 supervisors on a whiteboard.
| Supervisor | Workers | Memory | Timeout | Why It Exists |
|---|---|---|---|---|
| default | 10 | 768MB | 90s | General purpose work |
| webhook | 5 | 768MB | 90s | QB webhooks were backing up behind batch jobs |
| billing | 2-5 | 256MB | 120s | User-initiated billing was waiting behind background pricing |
| communications | 5 | 256MB | 120s | Individual messages were delayed by bulk SMS campaigns |
| integrations | 3 | 512MB | 360s | Complex multi-step QB syncs needed dedicated memory |
| sequential | 1 | 256MB | 120s | Data consistency issues from concurrent operations |
| exports | 3 | 4GB | 4 hours | Large report generation was OOM-crashing shared workers |
The Memory Leak Crisis
In April 2025, Redis latency spiked to 25,000ms. Queues backed up across all supervisors. The root cause was Horizon's cache tracker consuming unbounded memory, which starved Redis, which caused queue operations to time out, which caused more backups. A cascading failure starting from a monitoring feature.
Removed the cache tracker. Redis latency dropped to 0.5ms. The lesson: observability infrastructure needs the same resource discipline as production workloads. Your monitoring can be the thing that takes you down.
Hindsight
- › The incremental approach was right. Designing 8 supervisors upfront would have been speculative. Each one was added in response to a real production problem, which meant configuration was driven by actual workload characteristics, not guesses.
- › I should have pushed for server separation earlier. The single-server architecture masked the zombie process problem for months. If web, worker, and database had been isolated from the start, those runaway commands would have only affected their own box.
- › The artisan-command-in-controller pattern was a code smell I should have fought harder. Calling CLI commands from web request handlers bypasses the queue system entirely and creates processes invisible to Horizon. Eliminating it was part of the migration. In a greenfield, this would be a team standard from day one.
Platform Migration
Monolith → Microservices · Go · GCP · Pub/Sub
Migrating a Laravel monolith to Go microservices on GCP. The team originally planned a full rewrite, but business pressure forced a pivot: scope down to Lead Gen as the first domain, reuse v1.0 where it works, and prove the new architecture before committing the rest of the platform. I lead the Lead Gen domain.
The System We Were Replacing
RMC v1.0 was a Laravel monolith built fast in the company's early days. It worked. It powered 340+ dealerships and processed $155M in billing. But five years of feature pressure had left the architecture with real problems: controllers calling other controllers, business logic scattered across jobs, listeners, and artisan commands, no clear service boundaries, and a single MySQL database where every feature shared every table.
The issues I described in the async case study below (zombie processes, cascading crashes, the single-server resource exhaustion) were symptoms of the same root cause. The platform had outgrown its architecture. Async workers and server separation bought time, but the underlying coupling meant that changes in one domain still risked breaking another.
The Stack Decision
A new CTO joined and brought on a couple of developers the team had worked with before. They spent time on the v1.0 codebase first, then we all sat down to discuss what the next version should look like. Everyone agreed the architecture needed to change. The question was what to rebuild with.
The two serious options were Go + React or staying in the Laravel ecosystem with React on the frontend. The team was roughly split on experience. Nobody had a strong objection to Go, and a few factors tipped the decision: we were planning to lean heavily on AI-assisted development, and at the time Go and React had significantly more representation in AI training data than PHP. The tooling was noticeably better for generating idiomatic Go than idiomatic Laravel.
The honest truth is that the v1.0 architecture problems had gotten associated with Laravel itself, which wasn't entirely fair. You can build well-structured microservices in Laravel, and packages like Cashier would have handled most of our Stripe integration out of the box instead of rebuilding it from an SDK. With how much AI tooling has improved since, both approaches look viable to me today. But the team committed to Go and React, and the decision has held up.
The Big-Bang Plan (and Why We Abandoned It)
The original plan was a full rewrite: break the monolith into Go microservices on GCP, event-driven communication through Pub/Sub and Cloud Tasks, each service owning its own database schema, the whole picture. We wrote ADRs, defined service boundaries, and mapped out every domain: billing, CRM, lead routing, territory management, integrations, notifications.
Then business reality intervened. Marketing costs were increasing and dealers were frustrated with the existing lead billing model. The company needed a new lead generation system, one that shifted financial risk to dealers and eliminated the CPL disputes that were eating up operational time. They needed it on a timeline that a full platform rewrite couldn't deliver.
The Pivot: Lead Gen as Proving Ground
The team and stakeholders made a pragmatic decision: instead of rewriting everything, scope down to just the services needed for the new Lead Gen product. Build those as v2.0 microservices. Let v1.0 keep running for everything else. Where v1.0 already worked well (user management, the dealer portal, HubSpot sync), keep using it with minimal additions to communicate with the new services.
This did three things at once. It dramatically reduced scope to something the team could ship on the timeline the business needed. It gave us a real production environment to validate the new architecture (service boundaries, event-driven communication, the deployment pipeline) before committing the rest of the platform. And it delivered actual business value: a new revenue model for the company, not just a technical migration for its own sake.
I lead the Lead Gen domain. That meant owning the routing service design, coordinating with the billing and marketing services, and figuring out where v1.0 ends and v2.0 begins for each feature in the lead lifecycle.
How We Built It
The planning started outside the engineering team. For Lead Gen, we worked with stakeholders on a preliminary document to gather business requirements: the routing logic, payment flows, dealer contribution models, use cases, and open questions for operations, finance, and marketing. Many of the first design questions got answered there. As those details refined into formal specs, new questions surfaced in the documents themselves, and it was on the developer writing the spec to go back to stakeholders for answers. Not just within tech, but anyone in the company who had context we needed to get the design right.
That process produced ADRs for major architectural decisions, product requirements documents for each domain, and technical specs for implementation. The CTO, the developers, and I all contributed and reviewed each other's work. This was a deliberate departure from v1.0, where we followed agile's preference for working software over thorough documentation. In practice, that meant v1.0 docs were incomplete or outdated because systems changed faster than anyone could keep them current. With v2.0, the documents serve as both the implementation spec and the reference guide for each service. A non-technical stakeholder can read a PRD and understand what the system does. The documentation is treated like written code, not an afterthought.
Every service follows the same hexagonal architecture: business logic in an action layer with no framework dependencies, HTTP handlers as thin adapters, infrastructure concerns isolated behind interfaces. Shared Go packages handle the cross-cutting concerns (structured logging, correlation ID propagation, event publishing, authentication) so each service starts with consistent patterns instead of reinventing them.
When the team needed to work on services in parallel, everyone was building from the same specs. If someone wanted to know why we chose Stripe over QuickBooks Payments for card processing, the ADR explained the reasoning and the alternatives we rejected.
Where It Stands
The CRM service is stood up and the billing and lead services are in active development with the event-driven infrastructure wired and tested. The shared Go packages are stable and used across all services. Lead Gen is on track for production.
v1.0 keeps running alongside v2.0. That's by design, not by accident. The bridge approach means new services go live and start handling their domain while the legacy system continues to serve everything else. There's no big-bang cutover date. Each domain migrates when its v2.0 service is ready, validated, and the team is confident in the handoff.
Hindsight
-
›
We all knew the big-bang plan was wrong. That's not hindsight. Everyone acknowledged the risk when we agreed to it. What made it feel acceptable was the document-driven approach: you can review and validate specs the way you test code, and because our documents were so close to implementation detail, we had more confidence in what could get done than you'd typically have in a rewrite plan. That confidence was real. But scoping to one domain was still a strictly better path. It proved the architecture, tooling, and team workflow with a fraction of the risk, and delivered business value at the same time.
-
›
Go means rebuilding things Laravel gives you for free. Laravel Cashier handles Stripe customers, subscriptions, payment methods, invoices, and webhooks with a few config lines. In Go, we're building all of that against the Stripe SDK directly. It's the right trade-off for the type safety, performance, and concurrency model the team wanted, but the cost is real and worth acknowledging.
-
›
The doc-driven process front-loaded effort but eliminated rework. Writing ADRs and specs before code felt slow at first. But when three developers are building services that need to talk to each other, having the contracts defined upfront meant we weren't discovering integration mismatches in code review. The specs became the source of truth, not the code.
-
›
Both stacks could have worked. I still think Laravel microservices with React would have been a valid approach, especially given how much AI tooling has improved for PHP since we made the call. What mattered more than the stack choice was that the team aligned on one direction and committed to it. The worst outcome would have been half the team building in Go and half wishing they were in Laravel.
Let's talk
If these are the kinds of systems your team is building, I'd like to hear about it.
Get in Touch