Insights

The Missing Layer in AI Workflows: Context Architecture

Rudger de Groot

Founder & CRO

To start with a disclaimer: we are not in the business of selling software. We are a CRO agency that took a head-first approach to AI competition and can help you do the same.

Like many of you, my colleagues and I are self-educated. As such, we can rely on tons of practice-based knowledge and hard-earned experience. Throughout our careers in coding and A/B pioneering, we invested in staying at the forefront of our field.

But at the beginning of 2025, after years of lurking in the wings, AIs suddenly took off in a big way. There they were, highly skilled and usable for everyone with an IP address. It would be an understatement to say that it rattled our community around the globe.

We had a ton of questions. The most urgent: how quickly can AIs catch up to us? And will they make us obsolete?

A few months into this AI boom, we started thinking: why don’t we make ourselves obsolete? Why don’t we create a tool that can do large parts of our work, at the same (here comes the boosting 😁) award-winning level as us? An AI of our own that can analyse A/B experiment data and create a detailed, thorough report.

Not the smallest of goals, but with the use of AIs, it took just a month or two to create a basic prototype. And another month or two, we were ready for the first production tests. At this point, we were at approximately 90% completion. Then we started our ‘walk of nines’.

This walk – the road to reach 99.9% completion – took the better part of five months.

Not talking the talk but walking the walked

For those of you who aren’t in the software-creating business: the ‘walk of 9’s’ is that incremental push from 90% quality, to 99% quality, to near perfection at 99.9%. During this walk, each step gets exponentially harder, and each success is exponentially more satisfying. Let me highlight three of the milestones we reached along the way:

Adding structured output parsers so responses are consistently formatted and easy to automate.

Implementing sub-agents, so each task gets its own clean context window.

Creating skillsets that ensure AIs use the right scripts in the correct order.

Why we chase that last 0.1% (and it never actually is the last)

About nine months into the project – after a lot of trial and error, too many late nights, a few big scares (‘Oh my God, the whole setup deleted itself!’), some setbacks, an accidental insight, and several celebrated breakthroughs – we were this close to excellence.

But something critical was still off.

It wasn’t the analysis our AI generated; that was accurate each time. The tone was also just right. The formatting was consistent. But occasionally, our AI got crucial details wrong.

It would emphasise the wrong metric for a B2B client. Or use casual language in a premium product context. Or recommend a strategy that made sense in general, but ignored a client-specific constraint you documented elsewhere.

This last 0.1% was killing us, because it didn’t save us time. We just spent our time differently. While the AI derived an elaborate analysis from the experiment data (it did our job), we still had to comb through every report, looking for small but significant mistakes and what caused them.

We hoped to find something these faulty details had in common so we could create an overall solution. And eventually, we did: our setup lacked the proper context to reach perfection. Or maybe more accurately: reach perfection until the next challenge will occur 😆.

Context gets you the last 0.1% unless it is scattered

So, what is this ‘context’ exactly? As an agency that serves multiple clients, we need to keep our workflows and prompts generic and scalable. It is a ‘one size fits all’ approach, because our AI has the intelligence to derive actionable insights from each data set we feed it.

To translate these insights into advice that is perfectly tailored to each client, we built in context at the client level:

Client A’s brand voice guidelines are embedded in workflow 1.
The same client’s strategic priorities are hard-coded in workflow 2.
Their approval process is documented in workflow 3.
Et cetera.

In this architecture, we were fixing context gaps locally within individual workflows, which seemed okay because it worked. Outputs improved.

But we discovered that this – the location of our context – was what kept us from achieving that last 0.1% success; the context was scattered across workflows.

Every time we optimised one workflow’s context, we would later discover the same gaps in others. And each time a client’s variables changed, we had to hunt down every workflow that referenced them.

So, we may have built technically excellent workflows, but our context architecture was a mess. This made onboarding, maintaining, and offboarding a nightmare, impossible to sustain.

We needed to centralise context. Manage it actively. Strengthen the complete architecture of our AI.

Our architecture approach

The principle is: generic workflows + generic agents + the proper context = consistently excellent, client-specific output.

The key was understanding where different types of context belong. So let’s get a bit more technical and explore three pillars under our architecture.

Imagine you have hundreds of commands running, each capable of various tasks (a reality for us). This can make it very hard for the AI to find and use the proper commands. You help your AI to navigate and utilise all this information by creating sub-agents with context, skills with context, and a database with context. Now you can specify which commands are relevant for which task and how they should be executed, using good and bad examples.

In other words, you create detailed instructions for numerous possible situations and ensure an autonomous workflow with consistent, reliable results.

Claud-code Skills are practically the same concept as skills within the Matrix.

Sub-agents with context – these execute a specific task, and because of this partmentalisation, they can use all of their memory for this single task. The sub-agent is also equipped with the relevant skills for the task it needs to perform within a workflow. This means the same sub-agent can be used across different workflows, but in each workflow, it can have other skills with a specified context. It’s how Morpheus in the Matrix instantly becomes a kung-fu master, simply by having the operator upload that skill.

Skills with context – the significant advantage of skills is the ability to leverage progressive command discovery, not huge 1600-token dumps. This again saves us a lot of valuable context window memory without sacrificing performance.

System prompts (the “skills” level in Claude Code) should contain generic variables:
- Quality standards that apply universally. i.e., experiment runtime between 2 – 4 weeks, default start days Tuesday, Wednesday, Thursday, or Friday, or sample size data thresholds.
- The how of task execution.
- Definition of teammember roles and responsibilities.
User prompts (the execution level) should contain specific variables:
- Which client is it for?
- Which team members are on the team, and what is their role?
- Which industry context applies?
- The what and for whom of this task.

This separation ensures skills are reusable while outputs remain personalised.

Database with context data – as we wrote earlier, we have a database on client level. This is where we store all relevant context, including company information, terminology, tooling configuration, and policies. Around this database, we have created (and continue to create) tooling to make maintenance as simple as possible. The great thing about this centralised database is that workflows can remain generic, and it’s easier to build them because all the context is in one place. Workflows become multi-client and, therefore, by configuration, tailored.

2026 can be the year

Several trends are converging as AI evolves:

AI tools are commoditising – The model isn’t the differentiator anymore.
Workflow builders are maturing – n8n, Make, Zapier are ready for context-aware architectures.
Organizations are scaling AI – ‘Copy-paste-modify’ doesn’t work at 50+ workflows.
Quality expectations are rising – Users spot generic AI output immediately.
Costs are shifting – Engineering time > token costs.

Organisations that recognise this conversion will gain leverage. Those that don’t are destined to drown in workflow maintenance.

If you are serious about leveling up your experimentation program, whether that’s scaling your A/B testing, building smarter analysis workflows, or just getting your AI to finally understand your business—we’d love to talk.

We assist companies
scaling experimentation