Field notes
How to Implement AI in Your Organization Without Lighting Money on Fire
My DMs are full of "look what Claude did in 10 seconds" posts. And honestly, it can. I've watched AI handle drafts, mockups, data cleanup, and first passes on almost anything I throw at it.
But every time I take that into a real project, the same questions come up. Does it work in my environment? Can I ship this? How do I know it didn't make something up?
The AI part is fast. The building, deploying, and testing still takes time. Most of the effort goes into the parts nobody posts about.
The reality behind the demos
McKinsey's State of AI 2025 puts the share of companies using AI in at least one function at 88 percent. Only about a third have scaled past pilots. Just 21 percent have redesigned a workflow around AI. Only 39 percent report any measurable EBIT impact from AI in the past year, and most of those say AI accounts for under 5 percent of EBIT. Six percent qualify as AI high performers, the ones capturing 5 percent or more of their EBIT from AI use, and they look almost nothing like the rest.
Translation: everyone is using AI. Almost no one is getting paid for it yet.
The gap is not the model. The gap is the system around the model.
Klarna is the cautionary tale every CEO should keep on their desk. Their OpenAI-powered assistant handled 2.3 million conversations in its first month, did the work of about 700 full-time agents, cut average resolution time from 11 minutes to under 2, and ran in 35-plus languages across 23 markets. By mid-2025, Klarna was rehiring humans. The CEO admitted they had pushed too hard on efficiency at the cost of quality. The AI is now the front line, with human escalation underneath.
Both halves of that story are true. AI moved enormous volume. AI also created a brand problem that took human labor to fix. The lesson is not that AI failed. The rollout treated AI as a replacement instead of a system component, and that is what cost them.
The pattern repeats across industries. A bank deploys an internal copilot, watches usage spike for two weeks, and then watches it die because nobody put real reference data behind it and the answers were too generic to trust. A retailer launches a marketing AI, generates hundreds of off-brand ads, and shelves the whole project for a year. A law firm pilots contract review on a single deal, gets useful output, never builds the verification step, and quietly stops using it the day a partner finds a hallucinated citation. Different industry, same root cause: the AI is fine. The system around it does not exist.
The system is the boring part. The system is the point.
Here is the approach I use across proptech, fintech, and marketing teams. None of it is exotic. It is the same five questions any operations leader would ask before rolling out new software. AI just makes the consequences faster.
Before any of the five steps below, three questions need an answer in writing. Who owns AI inside the company. Which problem we are solving first. What "done" looks like in numbers. If those answers do not exist, you do not have an AI initiative. You have a Slack channel.
On ownership: in companies under 200 people, this is usually the COO, the head of product, or a strong operator the CEO trusts. In larger orgs, it is a small cross-functional council. Either way, one named person is accountable for the rollout, the policy, and the budget. AI without an owner becomes everybody's side project and nobody's P&L.
1. Load your brand, your context, and your data first
Before anyone in the org touches a model, your tone of voice, product positioning, brand guidelines, customer policies, and reference materials need to live somewhere the AI can read them. Otherwise the output sounds like a chatbot, not your company.
The simplest version is a structured project knowledge base inside whatever tool the team already uses. Claude Projects, custom GPTs, and Gemini Gems all let you upload reference files and define a system prompt once. For larger contexts that span Slack, Notion, Drive, email, and a dozen SaaS tools, an internal knowledge layer like Glean or Mem0 starts paying off.
The bigger shift is the Model Context Protocol. Anthropic released MCP in late 2024 as an open standard for connecting models to your tools and data. By April 2026, 78 percent of enterprise AI teams report at least one MCP-backed agent in production, and Anthropic has donated the protocol to the new Agentic AI Foundation under the Linux Foundation, co-founded with Block and OpenAI and backed by Google, Microsoft, AWS, and Cloudflare. If you are picking AI tooling now, MCP support is no longer a tiebreaker, it is table stakes.
There is also a cultural piece the tech does not solve. Ethan Mollick at Wharton calls the people who quietly automated their jobs secret cyborgs. They use AI every day. They will not tell you. They are afraid of the rules. The fix is to make sharing safe and rewarded. Publish a clear policy, run a weekly internal lab, ask people to demo what they figured out. The best AI ideas in your company already exist. They are hiding.
2. Set the guardrails before the team plugs in
Compliance filters and data-handling rules, decided up front. Not after someone pastes a confidential deal memo into a public tool. Define what AI is allowed to see, what it is allowed to do, and where the off-switch lives.
Microsoft's internal Copilot rollout is a good template even if you do not use Microsoft. They built a zoned governance model: separate environments for experimentation, pilot, and production, with approval workflows in between. They lean on labels and data classification (Microsoft Purview in their case) so sensitive content never enters a Copilot prompt by accident. They identify champions, train in waves, and measure adoption before they go wide.
For tooling, three categories matter. Input and output filters like Lakera Guard and Guardrails AI catch sensitive data and policy violations before content moves in either direction. Agent guardrails like NVIDIA NeMo Guardrails restrict what an LLM is allowed to do at runtime. Policy frameworks like the NIST AI RMF or your industry equivalent give your security team a vocabulary to evaluate the rest.
Simon Willison's framing helps here. He calls the dangerous combination the lethal trifecta: an agent with access to private data, that processes untrusted content, and can take external actions. Avoid all three at once and most of the catastrophic-failure modes get smaller. Audit your agents against that test every quarter.
3. Build a verification step into every workflow
Hallucination happens. Your only defense is process. Every output that leaves the building gets a human checkpoint. The reviewer changes by use case: legal for contracts, an analyst for numbers, a designer for visuals. For high-stakes work, two reviewers.
This is also where eval tooling earns its keep. Anthropic's Building effective agents essay (December 2024) is worth reading carefully and giving to your engineering lead. The core advice is to start with simple prompts, build evaluations first, and only add multi-step agentic systems when simpler solutions fall short. They name five workflow patterns: prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer. The last one is exactly the shape of automated verification, where one model checks the work of another in a loop.
For tooling, three platforms cover most enterprise needs. Braintrust gives you a unified data model where production traces, offline experiments, and CI tests share the same SDK and scorers, which is what you want when prompts are real production code. LangSmith ships from the LangChain team and integrates tightly if you are already on LangChain or LangGraph. Langfuse is the open-source choice (MIT licensed), self-hostable, SOC 2 and HIPAA compliant on the cloud version. Promptfoo is the lightweight CLI for fast prompt regression tests. Pick one, instrument every workflow, set quality gates before you ship.
4. Pick three workflows. Ship those properly.
The biggest mistake I see is teams trying to AI-enable everything in month one. Pick three workflows where the time savings are obvious and the failure modes are recoverable. Drafting, summarizing, internal research, data cleanup. Once those run cleanly, expand.
The data on focused, well-instrumented rollouts is strong. The GitHub Copilot study found developers with Copilot completed a coding task 55.8 percent faster than the control group. Follow-on studies showed 85 percent of developers felt more confident in code quality and code reviews completed 15 percent faster. That is not magic. That is a single, well-defined workflow with a competent verification step (the developer reviews the code) and tight feedback loops.
The Anthropic research point applies twice. Start with a simple workflow. Wait until you have measurable wins before adding agentic complexity. A cron job that summarizes yesterday's support tickets and emails the team is not glamorous. It is also more useful than three months of agent demos.
If you want a starting menu, here is what tends to work by department in my own portfolio. In marketing: first-draft brand-voice copy, social-post variants from a single brief, weekly content calendars, ad-creative iteration. In sales: meeting prep briefs from CRM data, proposal drafts, objection-response coaching, outbound personalization. In operations and finance: invoice and contract data extraction, weekly KPI digests, anomaly flagging in spend, vendor-onboarding paperwork. In real estate, which is most of my world: listing descriptions, comp analyses, multilingual client summaries, document review for due diligence. In product and engineering: PR and design-doc drafts, code review assist, release-notes generation, on-call ticket triage. None of these replace a person. All of them remove forty to seventy percent of the typing involved in their owner's job.
5. Measure what changed
Hours saved per week. Drafts produced. Time from request to first version. Customer-facing CSAT before and after. Cost per ticket. Pipeline volume. If the metrics do not move, the system is theater.
This is where most rollouts quietly fail. McKinsey's 2025 data shows fundamental workflow redesign is the single behavior most correlated with EBIT impact. AI high performers are nearly three times as likely to have fundamentally redesigned workflows than everyone else: 55 percent versus roughly 20. Twenty-three percent of all companies are scaling agentic systems somewhere, another 39 percent are still experimenting, but the gap between "we use AI" and "AI moves our P&L" is almost entirely explained by whether the work itself was redesigned around the new tool.
Pick metrics your CFO would care about, not metrics your AI vendor would. Hours saved is fine for a single team, but on its own it is the AI version of vanity revenue. Pair it with throughput (tickets resolved, deals worked, posts published), with quality (CSAT, error rate, revision count), and with cost (cost per output, all-in including the model bill, the seat licenses, and the human review time). If two of those three numbers move in the right direction, you have something real. If only one moves, you have a demo.
Andrej Karpathy's framing is useful here. He calls this Software 3.0: programming in English, with the LLM as a kind of operating system, the context window as RAM, and your prompts and tools as the program. If that is true, then workflow design is the architecture. Doing AI without redesigning the workflow is the same as buying a database and never restructuring your data.
Five mistakes I see almost every week
These are the patterns I have watched derail otherwise smart rollouts.
- Buying a platform before defining a workflow. Tools matter, but a tool with no defined job to do becomes a license sitting on someone's desk. Pick the workflow first.
- Hiring a "Head of AI" instead of redesigning a workflow. Titles do not move the P&L. Redesigned work does. McKinsey's 2025 data shows workflow redesign is the single behavior most correlated with EBIT impact.
- Letting every team pick their own model and tools. You will end up with five vector databases, three eval platforms, and no shared knowledge. One stack, one owner, exceptions in writing.
- Treating AI like a feature instead of a system. A demo is not a deployment. A deployment without monitoring, evals, and an owner is a demo that happens to be in production.
- Underestimating the change-management work. Mollick's "secret cyborgs" are using AI in your company right now. They will share their best ideas the day it stops feeling risky to do so. Make that day Tuesday.
A starter stack you can copy
If you want a concrete reference, here is a stack that gets a mid-size company to "AI is a real part of how we work" inside two quarters. None of these are paid placements. They are the tools I keep coming back to.
- Foundation models: Claude Sonnet 4.6 as the daily driver, Claude Opus 4.7 for the hardest agentic coding and reasoning (Anthropic), GPT-5.5 (OpenAI) for general utility, broad tool support, and long agentic runs, and Gemini 3.1 Pro (Google) for grounded search, 1M-token context, and native Workspace integration. Run them behind a router or proxy so you can swap as the leaderboard moves every few months.
- Knowledge and context: Glean or Mem0 for cross-app retrieval, Notion AI for team docs, MCP servers for the rest.
- Vector store: pgvector if you already run Postgres and have under ~50M vectors, Pinecone for fully managed scale, Weaviate when you need hybrid search.
- Orchestration: LangGraph for agent state machines, Temporal for durable long-running workflows, n8n for low-code business automation.
- Coding agents: Claude Code or Cursor for engineering teams, GitHub Copilot for embedded IDE use.
- Evaluation and observability: Braintrust, LangSmith, or Langfuse, picked once and used everywhere.
- Guardrails: Lakera Guard, Guardrails AI, or NeMo Guardrails on top of whatever model you serve.
This stack is not the answer. It is a starting point that will be wrong in 18 months. The discipline is having one, written down, owned by a person, and reviewed every quarter, instead of a different stack per team.
What this gets you
Once the scaffolding exists, the 10-second outputs become real. Not slide-deck demos for the next board meeting. Working tools your team trusts and uses on a Tuesday morning to ship faster, with fewer revisions, and without the "wait, did the AI make this up?" panic.
The shift in 2026 is from chatbots to agents: software that holds context for hours, calls dozens of tools, and finishes a multi-step job without babysitting. That shift only pays off if the rails underneath, the data, the guardrails, the verification, the workflow redesign, are already in place. Agents amplify whatever system you have. If the system is sloppy, the agent ships sloppy faster.
I spent more time on scaffolding than on prompting. That tradeoff is the whole game. The boring system is what makes the exciting outcomes possible.
If you are a CEO thinking about AI implementation, build the rails first. The wins follow.