دسته: other

Visual understanding: Unlocking the next frontier in AIVisual understanding: Unlocking the next frontier in AI

Visual understanding: Unlocking the next frontier in AI

At the NYC AIAI Summit, Joseph Nelson, CEO & Co-Founder of Roboflow, took the stage to spotlight a critical but often overlooked frontier in AI: vision.

In a field dominated by breakthroughs in language models, Nelson argued that visual understanding – or how machines interpret the physical world – is just as essential for building intelligent systems that can operate in real-world conditions.

From powering instant replay at Wimbledon to enabling edge-based quality control in electric vehicle factories, his talk offered a grounded look at how visual AI is already transforming industries – and what it will take to make it truly robust, accessible, and ready for anything.

Roboflow now supports a million developers. Nelson walked through what some of them are building: real-world, production-level applications of visual AI across industries, open-source projects, and more. These examples show that visual understanding’s already being deployed at scale.

Three key themes in visual AI today

Nelson outlined three major points in his talk:

  1. The long tails of computer vision. In visual AI, long-tail edge cases are a critical constraint. These rare or unpredictable situations limit the ability of models, including large vision-language models, to fully understand the real world.
  2. What the future of visual models looks like. A central question is whether one model will eventually rule them all, or whether the future lies in a collection of smaller, purpose-built models. The answer will shape how machine learning is applied to visual tasks going forward.
  3. Running real-time visual AI at the edge. Nelson emphasized the importance of systems that run on your own data, in real-time, at the edge. This isn’t just a technical detail; it’s foundational to how visual AI will be used in the real world.

Where AI meets the real world: The role of computer vision

Joseph Nelson framed computer vision as the point where artificial intelligence intersects directly with the physical world. “At Roboflow, we think that computer vision is where AI meets the real world,” he explained.

He emphasized that vision is a primary human sense, predating even language, and pointed out that some civilizations thrived without a written system, relying instead on visual understanding. That same principle applies when building software systems: giving them vision is like giving them read access to the physical world.

This visual “read access” enables software to answer practical questions:

  • How many people are in a conference room?
  • Were a set of products manufactured correctly?
  • Did objects make their way from point A to point B?

Nelson illustrated this with a range of real-world examples:

  • Counting candies produced in a shop
  • Validating traffic flows and lane usage
  • Evaluating a basketball player’s shot
  • Measuring field control in a soccer match

Despite the diversity of these applications, the common thread is clear: each uses visual understanding to generate actionable insights.

How Roboflow powers visual AI at scale

Roboflow’s mission is to provide the tools, platform, and solutions that enable enterprises to build and deploy visual AI. According to Nelson, users typically approach Roboflow in one of two ways:

  • By creating open-source projects, much like publishing code on GitHub
  • By building private projects for internal or commercial use

On the open-source front, Roboflow has become the largest community of computer vision developers on the web. The scale is significant:

  • Over 500 million user-labeled and shared images
  • More than 200,000 pre-trained models available

This ecosystem provides Roboflow with a unique insight into how computer vision is being applied, where it encounters challenges, and where it delivers the most value.

Roboflow also serves a wide range of enterprise customers. Nelson shared that more than half of the Fortune 100 have built with Roboflow, especially in domains grounded in the physical world, such as:

  • Retail operations
  • Electric vehicle manufacturing
  • Product logistics and shipping

Backed by a strong network of investors, Roboflow continues to grow its platform and support the expanding needs of developers and businesses working at the intersection of AI and the real world.

AI-powered healthcare, with Archie Mayani [Video]
How AI is tackling healthcare supply chain disruptions—featuring GHX’s Archie Mayani on predictive tools, alternatives, and patient care.
Visual understanding: Unlocking the next frontier in AI

Enabling the future of visual AI

Joseph Nelson outlined Roboflow’s vision for the future of visual AI: empowering developers to build models tailored to the specific scenes and contexts they care about. For this future to materialize, he emphasized two critical factors:

  • Making it easy to build models that understand the right context
  • Making it just as easy to deploy those models into larger systems

“When you do that,” Nelson said, “you give your software the ability to have read access to the real world.”

Open source as the backbone of progress

Roboflow’s commitment to open source is a core part of that strategy. One of their flagship open-source tools is a package called Supervision, designed to help developers process detections and masks and integrate them into broader systems.

“We’re big believers in open source AI,” Nelson explained, “as a way by which the world understands everything this technology can do.”

To illustrate adoption, he compared Supervision’s usage with that of PyTorch Vision, a widely recognized set of tools in the machine learning community. “The purple line is Supervision. The red line is PyTorch Vision. And we’re proud that some of the tools that we make are now as well recognized across the industry.”

A growing ecosystem of visual AI tools

Beyond Supervision, Roboflow maintains a suite of open-source offerings, including:

  • Trackers for following objects across scenes
  • Notebooks for visual understanding tasks
  • Inference tools for deployment
  • Maestro, a package for fine-tuning domain-specific language models (DLMs)

Together, these tools aim to simplify the entire visual AI pipeline, from training and understanding to deployment and fine-tuning, anchored in a philosophy of accessibility and openness.

The range and reality of visual AI in action

Joseph Nelson’s perspective on visual AI is grounded in scale. Roboflow has visibility into hundreds of millions of images, hundreds of thousands of datasets, and hundreds of thousands of models. This massive footprint offers deep insight into how visual AI is actually being applied in the real world.

These applications span industries and creativity levels alike, ranging from aerial understanding to pill counting in pharmaceutical settings and defect detection during product manufacturing. “And a whole host of things in between,” Nelson added.

From silly to serious: A spectrum of use cases

Roboflow, as an open-source platform, supports a wide range of users: from hobbyists working on whimsical experiments to enterprises solving mission-critical problems.

Nelson highlighted this spectrum with a few vivid examples:

  • Dice roll trackers for D&D (Dungeons and Dragons): Systems that visually detect dice results during tabletop gaming sessions.
  • Roof damage detection: Insurance companies use drone imagery to assess hailstorm-related roof damage.
  • Vehicle production monitoring: Manufacturers inspect parts, such as those produced by stamping presses on car doors, to detect defects in real-time.

Flamethrowers and laser pointers: A creative approach

Among the more eccentric community creations:

  • Flamethrower weed-killing robots. Two different developers utilized Roboflow to design robots that identify and eliminate weeds using flamethrowers. “Why pull them out,” Nelson recounted, “when I can use a flamethrower to eliminate said weeds?”
  • Cat exercise machine. A developer, during COVID lockdowns, built an exercise tool for his cat by mounting a laser pointer on a robotic arm. A computer vision model tracked the cat’s position and aimed the laser ten feet away to keep the cat moving. “This one does come with a video,” Nelson joked, “and I’m sure it’s one of the most important assets you’ll get from this conference.”

These projects highlight accessibility. “A technology has really arrived,” Nelson noted, “when a hacker can take an idea and build something in an afternoon that otherwise could have seemed like a meme or a joke.”

Nelson continued with more creative and practical applications of Roboflow-powered models:

  • Drone-based asset detection. Models identify swimming pools and solar panels in aerial imagery, which is useful for projects like LEED (Leadership in Energy and Environmental Design) certification that measure the renewable energy contributions to buildings.
  • Zoom overlays via OBS. By integrating Roboflow with Open Broadcast Studio (OBS), users can overlay effects during video calls. For instance, specific hand gestures can trigger on-screen animations, like “Lenny,” Roboflow’s purple raccoon mascot, popping into view.
  • Rickroll detection and prevention. For April Fools’ Day, the Roboflow team built a browser-based model that detects Rick Astley’s face and automatically blacks out the screen and mutes the sound, preventing the infamous “Rickroll.” This demo uses Inference JS, a library that runs models entirely client-side in the browser. It even works offline.

While playful, this approach also has commercial applications, such as content moderation, page navigation, and web agent vision for interpreting and interacting with on-page elements.

Some of Roboflow’s early inspiration came from augmented reality. Nelson shared an example of an app that uses Roboflow tools to solve AR Sudoku puzzles. Here, the front end is AR, but the back end is computer vision, which detects puzzle edges, identifies number placements, and performs a breadth-first search to solve the board.

“It’s an example of using models in production in a fun way,” Nelson said.

How large language models are transforming pediatric healthcare
Discover how the NHS’s GOSH is using large language models and AI to transform pediatric healthcare, improving outcomes for rare disease treatments.
Visual understanding: Unlocking the next frontier in AI

From demos to deployment: Roboflow in the enterprise

On the enterprise side, Roboflow supports large-scale deployment at companies like Rivian, the electric vehicle manufacturer. Over the last three years, Rivian has used Roboflow to deploy approximately 300 models into production at its Normal, Illinois, general assembly facility.

The use cases are essential to ensuring product quality:

  • Verifying the correct number of screws in a battery
  • Detecting bolts left on the factory floor (which can cause vehicles to jam or collide)
  • Checking that paint is applied evenly and correctly

“There are hundreds and hundreds of checks that need to take place,” Nelson explained, “to ensure that a product is made correctly before it makes its way to an end user.”

Roboflow’s tools are designed not just for experimentation, but for deployment at scale. Joseph Nelson emphasized how companies like Rivian are taking advantage of this flexibility by pairing manufacturing engineers with technology teams. The result? Rapid iteration:

“I see a problem, I capture some data, I train the model, and I deploy it in an afternoon.”

This agility enables teams to solve a wide variety of small but critical manufacturing problems without waiting for long development cycles.

Self-checkout at scale: Visual AI at Walmart

Roboflow also powers instant checkout kiosks at Walmart. Describing the company playfully as an “up-and-coming retailer,” Nelson explained that Walmart’s kiosks use cameras to detect every item in a shopping cart as it enters a designated scanning zone. The system then generates a receipt automatically.

This challenge is far from trivial. Walmart stores carry hundreds of thousands of SKUs, making this a true test of model generalization and accuracy in real-world conditions. Despite the complexity, Nelson noted with pride that these systems have made it into production at multiple locations, with photo evidence to prove it.

Instant replay at major tennis events

Another high-profile deployment: instant replay at Wimbledon, the US Open, and the French Open (Roland Garros). Roboflow’s computer vision models are used to:

  • Detect the positions of tennis players and the ball
  • Map out the court in real time
  • Enable automated camera control and clip selection for ESPN+ replays

This use case reinforces the versatility of visual AI. Nelson even drew a humorous parallel:

“Remember that cat workout machine? Turns out it’s actually not so different from the same things that people enjoy at one of the world’s most popular tennis events.”

From hacks to healthcare: More unconventional use cases

Nelson wrapped this section with two more examples that demonstrate visual AI’s creative range, from civic life to scientific research.

  • Tennis court availability alerts. A hacker in San Francisco set up a camera on his windowsill to monitor public tennis courts. Using a custom-trained model, the system detects when the court is empty and sends him a text, so he can be the first to play.
  • Accelerating cancer research. At a university in the Netherlands, researchers are using computer vision to count neutrophils, a protein reaction observed after lab experiments. Previously, this required manual analysis by graduate students. Now, automation speeds up the review process, allowing faster iteration and contributing, albeit in a small way, to accelerating treatments for critical diseases.

What comes next for visual AI?

With a wide array of applications, ranging from retail to robotics to research, visual AI is already shaping various industries. As Joseph Nelson transitioned to the next part of his talk, he posed the central question:

“So what gives? What’s the holdup? What happens next, and where is the field going?”

That’s the conversation he takes on next.

As visual AI moves from research labs into production, it must contend with the messy, unpredictable nature of the real world. Joseph Nelson explained this challenge using a familiar concept: the normal distribution.

When it comes to visual tasks, the distribution is far from tidy. While some common categories, like people, vehicles, and food, fall within the high-density “center” of the curve (what Nelson calls “photos you might find on Instagram”), there’s a very long tail of edge cases that AI systems must also learn to handle.

These long tails include:

  • Hundreds of thousands of unique SKUs in retail environments like Walmart
  • Novel parts are produced in advanced manufacturing settings like Rivian’s facilities
  • Variable lighting conditions for something as niche as solving a Sudoku puzzle
AI and the future of international student outreach
How AI can help UK universities compete globally by streamlining admissions and supporting international students more effectively.
Visual understanding: Unlocking the next frontier in AI

Why visual data is uniquely complex

Nelson also emphasized a fundamental difference between visual and textual data: density and variability. While all of the text can be encoded compactly in Unicode, even a single image contains vastly more data, with RGB channels, lighting nuances, textures, and perspective all influencing interpretation.

“From a very first principles view, you can see that the amount of richness and context to understand in the visual world is vast.”

Because of this complexity, fully reliable zero-shot or multi-shot performance in visual AI remains aspirational. To achieve dependable results, teams must often rely on:

  • Multi-turn prompting
  • Fine-tuning
  • Domain adaptation

Put simply, Nelson said, “The real world’s messy.”

When small errors become big problems

The stakes for accuracy grow when visual AI is used in regulated or high-risk environments. Nelson shared an example from a telepharmacy customer using Roboflow to count pills.

This customer is accountable to regulators for correctly administering scheduled substances. A count that’s off by just one or two pills per batch may seem minor, but when that error scales across all patients, it creates major discrepancies in inventory.

Worse, Nelson noted a surprising discovery: telepharmacists with painted long nails occasionally caused the model to misclassify fingers as pills. This kind of visual anomaly wouldn’t appear in a clean training dataset, but it’s common in real-world deployments.

This example illustrates a critical principle in visual AI:

Real-world reliability requires robust systems trained on representative data and a readiness for edge cases that defy textbook distributions.

Why existing datasets aren’t enough

Joseph Nelson pointed to a key limitation in visual AI: many models are trained and evaluated on narrow, conventional datasets that don’t reflect the full diversity of the real world.

A common reference point is the COCO (Common Objects in Context) dataset, created by Microsoft between 2012 and 2014. It includes 80 familiar classes such as person, food, and dog. Nelson described this as data that falls “right up the center of the bell curve.”

But the world is far messier than that. To push the field forward, Nelson argued, we need more than common object detection; we need better evaluation frameworks that test whether models can adapt to novel domains.

Introducing RF100VL: A new benchmark for visual AI

To address this, Roboflow introduced RF100VL, short for Roboflow 100 Vision Language. While the name may not be flashy, the purpose is clear: provide a more accurate view of model performance in the real world.

RF100VL consists of:

  • 100 datasets
  • 100,000 images
  • Covering diverse domains, such as:
    • Microscopic and telescopic imagery
    • Underwater environments
    • Document analysis
    • Digital worlds
    • Industrial, medical, aerial, and sports contexts
    • Flora and fauna categories

This dataset reflects how developers and companies are actually using computer vision. Because Roboflow hosts the largest community of computer vision developers, RF100VL is built from real-world production data and maintained by the community itself.

“If COCO is a measurement of things you would commonly see,” Nelson said, “RF100VL is an evaluation of everything else, the messy parts of the real world.”

Testing zero-shot performance

Nelson highlighted the limitations of current models by sharing evaluation results. For example, the best-performing zero-shot model on RF100VL (Grounding DINO) achieved just 15.7% accuracy. That’s with no fine-tuning, meant to simulate a model’s raw ability to generalize across unseen tasks.

“Even the best models in the world today still lack the ability to do what’s called grounding and understanding of relatively basic things.”

RF100VL aims to fill that gap. It allows developers to hold large models accountable and measure whether they generalize well across domains.

Scaling like language models

So how can we improve visual understanding? Nelson pointed to the same approach used in language AI:

  • Pre-training at a large scale
  • Expanding datasets to include the long tail
  • Providing rich, diverse context for model adaptation

In the same way large language models improved through massive, inclusive training sets, visual AI must follow suit. Nelson emphasized that context is king in this process.

Tools for diagnosing model gaps

To make evaluation accessible, Roboflow launched visioncheckup.com, a playful but practical tool for visual model assessment. “It’s like taking your LLM to the optometrist,” Nelson said. The site simulates a vision test, showing where a model struggles and where it succeeds.

While some visual tasks, such as counting or visual question answering, still trip up models, Nelson noted that progress is accelerating. And he made his bet clear:

“I would bet on the future of open science.”

Why real-time performance at the edge matters

As the talk turned toward deployment, Nelson highlighted one major difference between language and vision models: where and how they run.

Language tasks often benefit from:

  • Cloud hosting
  • Access to extensive compute resources
  • Room for delayed, test-time reasoning (e.g., web lookups)

But visual systems frequently need to operate:

  • In real time
  • On edge devices
  • With no reliance on cloud latency or large compute pools

“Models need to run fast, real time, and often at the edge,” Nelson said.

This makes deployment requirements for visual AI fundamentally different and much more constrained than those for language-based systems.

Why edge performance matters

Joseph Nelson closed his talk by returning to a central constraint in visual AI: latency.

Unlike many language-based AI tasks, visual systems often have to operate under real-time conditions, especially in settings like live sports broadcasts. For example, Wimbledon doesn’t have the luxury of cloud processing, even with high-speed internet. Every frame must be processed live, with sub–10 millisecond latency.

This creates a pressing requirement: models that run efficiently on constrained hardware, without sacrificing accuracy.

“You need to have models that can perform in edge conditions… on smaller or constrained compute and run in real time.”

Edge deployment isn’t just for industrial hardware, like NVIDIA Jetsons. Nelson highlighted another vision: empowering individual developers to run models on their own machines, locally and independently.

“We can actually kind of power the rebels so that anyone can create, deploy, and build these systems.”

Introducing RF-DETR: Transformers at the edge

To address the challenge of bringing large-model performance to real-time use cases, Roboflow developed RF-DETR, short for Roboflow Detection Transformer. This model is designed to combine the contextual strength of transformers with the speed and deployability needed at the edge.

Nelson contrasted it with existing models, like the YOLO family, which are built around CNNs and optimized for speed. RF-DETR aims to bring the pre-training depth of transformers into the real-time performance zone.

“How do we take a large transformer and make it run in real time? That’s what we did.”

RF-DETR was benchmarked on both:

  • Microsoft COC for conventional object detection
  • RF100VL to measure real-world adaptability
Edge AI vs Cloud AI — which one’s better for business?
The differences between Edge AI and Cloud AI come into play primarily for machine learning and deep learning use cases.
Visual understanding: Unlocking the next frontier in AI

Bridging the gap to visual AGI

In wrapping up, Nelson tied together the major themes of his talk:

  • Better datasets (like RF100VL)
  • Better models (like RF-DETR)
  • Deployment flexibility, including on constrained and local hardware

Together, these advancements move us beyond the metaphor of “a brain in a jar.” Instead, Nelson described the vision for a true visual cortex, a key step toward real-world AI systems that can see, reason, and act.

“When we build with Roboflow… you’re a part of making sure that AI meets the real world and delivers on the promise of what we know is possible.”

Final thoughts

Joseph Nelson closed his talk at the NYC AIAI Summit with a clear message: for AI to meet the real world, it must see and understand it. That means building better datasets, such as RF100VL, creating models that generalize across messy, real-world domains, and ensuring those models can run in real-time, often at the edge.

From live sports broadcasts to pharmaceutical safety checks, and from open-source cat toys to advanced vehicle assembly lines, the breadth of visual AI’s impact is already vast. But the work is far from over. As Nelson put it, we’re still crossing the bridge, from large models as “brains in a jar” to intelligent systems with a working visual cortex.

By contributing to open-source tools, adapting models for deployment in the wild, and holding systems accountable through realistic evaluations, developers and researchers alike play a crucial role in advancing visual understanding. Roboflow’s mission is to support that effort, so that AI not only thinks, but sees.

How Agentic AI is transforming healthcare deliveryHow Agentic AI is transforming healthcare delivery

How Agentic AI is transforming healthcare delivery

In the Agents of Change podcast, host Anthony Witherspoon welcomes Archie Mayani, Chief Product Officer at GHX (Global Healthcare Exchange), to explore the vital role of artificial intelligence (AI) in healthcare.

GHX is a company that may not be visible to the average patient, but it plays a foundational role in ensuring healthcare systems operate efficiently. As Mayani describes it, GHX acts as “an invisible operating layer that helps hospitals get the right product at the right time, and most importantly, at the right cost.”

GHX’s mission is bold and clear: to enable affordable, quality healthcare for all. While the work may seem unglamorous, focused on infrastructure beneath the surface, it is, in Mayani’s words, “mission critical” to the healthcare system.

Pioneering AI in the healthcare supply chain

AI has always been integral to GHX’s operations, even before the term became a buzzword. Mayani points out that the company was one of the early adopters of technologies like Optical Character Recognition (OCR) within healthcare supply chains, long before such tools were formally labeled as AI.

This historical context underlines GHX’s longstanding commitment to innovation.

Now, with the rise of generative AI and agentic systems, the company’s use of AI has evolved significantly. These advancements are being harnessed for:

  • Predicting medical supply shortages
  • Enhancing contract negotiations for health systems
  • Improving communication between clinicians and supply chain teams using natural language interfaces

All of these tools are deployed in service of one goal: to provide value-based outcomes and affordable care to patients, especially where it’s needed most.

Building resilience into healthcare with “Resiliency AI”

GHX builds resilience. That’s the ethos behind their proprietary system, aptly named Resiliency AI. The technology isn’t just about automation or cost-savings; it’s about fortifying healthcare infrastructure so it can adapt and thrive in the face of change.

Mayani articulates this vision succinctly: “We are not just building tech for healthcare… we are building resilience into healthcare.”

Anthony, the podcast host, highlights a key point: AI’s impact in healthcare reaches far beyond business efficiency. It touches lives during their most vulnerable moments.

The episode highlights a refreshing narrative about AI: one not focused on threats or ethical concerns, but rather on how AI can be an instrument of positive, human-centered change.

The imperative of responsible AI in healthcare

One of the core themes explored in this episode of Agents of Change is the pressing importance of responsible AI; a topic gaining traction across industries, but particularly crucial in healthcare. Host Anthony sets the stage by highlighting how ethics and responsibility are non-negotiable in sectors where human lives are at stake.

Archie Mayani agrees wholeheartedly, emphasizing that in healthcare, the stakes for AI development are dramatically different compared to other industries. “If you’re building a dating app, a hallucination is a funny story,” Mayani quips. “But in [healthcare], it’s a lawsuit; or worse, a life lost.” His candid contrast underscores the life-critical nature of responsible AI design in the medical field.

Why responsible AI will need to be your new USP
We explore the benefits of AI governance, its key components, and best practices for minimal overhead and maximal trust.
How Agentic AI is transforming healthcare delivery

Transparency and grounding: The foundation of ethical AI

For GHX, building responsible AI begins with transparency and grounding. Mayani stresses that these principles are not abstract ideals, but operational necessities.

“Responsible AI isn’t optional in healthcare,” he states. It’s embedded in how GHX trains its AI models, especially those designed to predict the on-time delivery of surgical supplies, which are crucial for patient outcomes.

To ensure the highest level of reliability, GHX’s AI models are trained on a diverse range of data:

  • Trading partner data from providers
  • Fulfillment records
  • Supplier reliability statistics
  • Logistical delay metrics
  • Historical data from natural disasters

This comprehensive data approach allows GHX to build systems that not only optimize supply chain logistics but also anticipate and mitigate real-world disruptions, delivering tangible value to hospitals and, ultimately, patients.

Explainability is key: AI must justify its decisions like a clinician

One of the most compelling points Archie Mayani makes in the discussion is that AI must explain its logic with the clarity and accountability of a trained clinician. This is especially important when dealing with life-critical healthcare decisions. At GHX, every disruption prediction produced by their AI system is accompanied by a confidence score, a criticality ranking, and a clear trace of the data sources behind the insight.

“If you can’t explain it like a good clinician would, your AI model is not going to be as optimized or effective.”

This standard of explainability is what sets high-functioning healthcare AI apart. It’s not enough for a model to provide an output; it must articulate the “why” behind it in a way that builds trust and enables action from healthcare professionals.

Avoiding AI hallucinations in healthcare

Mayani also reflects on historical missteps in healthcare AI to highlight the importance of data diversity and governance. One case he references is early AI models for mammogram interpretation. These systems produced unreliable predictions because the training data lacked diversity across race, ethnicity, and socioeconomic background.

This led to models that “hallucinated”, not in the sense of whimsical errors, but with serious real-world implications. For example, differences in breast tissue density between African American and Caucasian women weren’t properly accounted for, leading to flawed diagnostic predictions.

To counteract this, GHX emphasizes:

  • Inclusive training datasets across demographic and physiological variables
  • Rigorous data governance frameworks
  • A learning mindset that adapts models based on real-world feedback and outcomes

This commitment helps ensure AI tools in healthcare are equitable, reliable, and aligned with patient realities, not just technical possibilities.

The conversation also touches on a universal truth in AI development: the outputs of any model are only as good as the inputs provided. As Anthony notes, AI doesn’t absolve humans of accountability. Instead, it reflects our biases and decisions.

“If an AI model has bias, often it’s reflective of our own societal bias. You can’t blame the model; it’s showing something about us.”

This reinforces a central thesis of the episode: Responsible AI begins with responsible humans; those who train, test, and deploy the models with intention, transparency, and care.

Earning confidence in AI-driven healthcare

As AI becomes more embedded in healthcare, public fear and discomfort are natural reactions, particularly when it comes to technologies that influence life-altering decisions. Anthony captures this sentiment, noting that any major innovation, especially in sensitive sectors like healthcare, inevitably raises concerns.

Archie Mayani agrees, emphasizing that fear can serve a constructive purpose. “You’re going to scale these agents and AI platforms to millions and billions of users,” he notes. “You better be sure about what you’re putting out there.” That fear, he adds, should drive greater diligence, bias mitigation, and responsibility in deployment.

The key to overcoming this fear? Transparency, communication, and a demonstrable commitment to ethical design. As Mayani and Anthony suggest, trust must be earned, not assumed. Building that trust involves both technical rigor and emotional intelligence to show stakeholders that AI can be both safe and valuable.

The challenge of scaling agentic AI in healthcare

With a strong foundation in ethical responsibility, the conversation shifts to a pressing concern: scaling agentic AI models in healthcare environments. These are AI systems capable of autonomous decision-making within predefined constraints, highly useful, but difficult to deploy consistently at scale.

Mayani draws an apt analogy: scaling agentic AI in healthcare is like introducing a new surgical technique.

“You have to prove it works, and then prove it works everywhere.”

This speaks to a fundamental truth in health tech: context matters. An AI model trained on datasets from the Mayo Clinic, for example, cannot be transplanted wholesale into a rural community hospital in Arkansas. The operational environments, patient demographics, staff workflows, and infrastructure are vastly different.

Key barriers to AI scalability in healthcare

  1. Contextual variability. Every healthcare setting is unique in terms of needs, infrastructure, and patient populations.
  2. Data localization. Models must be fine-tuned to reflect local realities, not just generalized benchmarks.
  3. Performance assurance. At scale, AI must remain accurate, explainable, and effective across all points of care.

For product leaders like Mayani, scale and monetization are the twin pressures of modern AI deployment. And in healthcare, the cost of getting it wrong is too high to ignore.

GHX’s resiliency center: A scalable AI solution in action

To illustrate how agentic AI can be successfully scaled in healthcare, Archie Mayani introduces one of GHX’s flagship products: Resiliency Center. This tool exemplifies how AI can predict and respond to supply chain disruptions at scale, offering evidence-based solutions in real time.

Resiliency Center is designed to:

  • Accurately categorize and predict potential disruptions in the healthcare supply chain
  • Recommend clinical product alternatives during those disruptions
  • Integrate seamlessly across dozens of ERP systems, even with catalog mismatches
  • Provide evidence-backed substitute products, such as alternatives to specific gloves or catheters likely to be back-ordered

These “near-neighborhood” product recommendations are not only clinically valid, but context-aware. This ensures that providers always have access to the right product, at the right time, at the right cost, a guiding principle for GHX.

“The definition of ‘right’ is really rooted in quality outcomes for the patient and providing access to affordable care, everywhere.”

This operational model is a clear example of scaling with purpose. It reflects Mayani’s earlier point: you can’t scale effectively without training on the right datasets and incorporating robust feedback loops to detect and resolve model inaccuracies.

Making sense of healthcare data

As the conversation shifts to the nature of healthcare data, Anthony raises a key issue: data fragmentation. In healthcare, data often exists in disconnected silos, across hospitals, systems, devices, and patient records, making it notoriously difficult to use at scale.

Mayani affirms that overcoming this fragmentation is essential for responsible and effective AI. The foundation of scalable, bias-free, and high-performance AI models lies in two critical pillars:

  1. Data diversity. AI systems must be trained on varied and inclusive datasets that reflect different patient populations, healthcare contexts, and operational environments.
  2. Data governance. There must be strict protocols in place to manage, verify, and ethically handle healthcare data. This includes everything from ensuring data integrity to setting up feedback mechanisms that refine models over time.

“All of that, scaling, performance, bias mitigation, it ultimately comes down to the diversity and governance of the data.”

This framing offers a critical insight for healthcare leaders and AI practitioners alike: data is the bedrock of trustworthy AI systems in medicine.

Crafting ethical AI: Addressing bias and challenges
We set out to understand how businesses address these ethical AI issues by surveying practitioners and end users.
How Agentic AI is transforming healthcare delivery

Why local context and diverse data matter in healthcare AI

One of the most illustrative examples of data diversity’s value came when GHX’s models flagged a surgical glove shortage in small rural hospitals, a disruption that wasn’t immediately visible in larger healthcare systems. Why?

  • Rural hospitals often have different reorder thresholds.
  • They typically lack buffer stock and have fewer supplier relationships compared to large Integrated Delivery Networks (IDNs).

This nuanced insight could only emerge from a truly diverse dataset. As Archie Mayani explains, if GHX had only trained its models using data from California, it might have overlooked entirely seasonal and regional challenges, like hurricanes in the Southeast or snowstorms in Minnesota, that affect supply chains differently.

“Healthcare isn’t a monolith. It’s a mosaic.”

That mosaic requires regionally relevant, context-sensitive data inputs to train agentic AI systems capable of functioning across a broad landscape of clinical settings.

Trust and data credibility: The often overlooked ingredient

Diversity in data is only part of the solution. Trust in data sources is equally critical. Archie points out a fundamental truth: not all datasets are equally valid. Some may be outdated, siloed, or disconnected from today’s realities. And when AI systems train on these flawed sources, their predictions suffer.

This is where GHX’s role as a trusted intermediary becomes essential. For over 25 years, GHX has served as a neutral and credible bridge between providers and suppliers, earning the trust required to curate, unify, and validate critical healthcare data.

“You need a trusted entity… not only for diverse datasets, but the most accurate, most reliable, most trusted datasets in the world.”

GHX facilitates cooperation across the entire healthcare data ecosystem, including:

  • Hospitals and providers
  • Medical suppliers and manufacturers
  • Electronic Medical Record (EMR) systems
  • Enterprise Resource Planning (ERP) platforms

This integrated ecosystem approach ensures the veracity of data and enables more accurate, bias-aware AI models.

Diversity and veracity: A dual mandate for scalable AI

Anthony aptly summarizes this insight as a two-pronged strategy: it’s not enough to have diverse datasets; you also need high-veracity data that’s trusted, updated, and contextually relevant. Mayani agrees, adding that agentic AI cannot function in isolation; it depends on a unified and collaborative network of stakeholders.

“It’s beyond a network. It’s an ecosystem.”

By connecting with EMRs, ERPs, and every link in the healthcare chain, GHX ensures its AI models are both informed by real-world variability and grounded in validated data sources.

From classical AI to agentic AI: A new era in healthcare

Archie Mayani makes an important distinction between classical AI and agentic AI in healthcare. For decades, classical AI and machine learning have supported clinical decision-making, especially in diagnostics and risk stratification. These systems helped:

  • Identify patients with complex comorbidities
  • Prioritize care for those most at risk
  • Power early diagnostic tools such as mammography screenings

“We’ve always leveraged classical AI in healthcare… but agentic AI is different.”

Unlike classical models that deliver discrete outputs, agentic AI focuses on workflows. It has the potential to abstract, automate, and optimize full processes, making it uniquely suited to address the growing pressures in modern healthcare.

Solving systemic challenges with agentic AI

Mayani highlights the crisis of capacity in today’s healthcare systems, particularly in the U.S.:

  • Staff shortages across both clinical and back-office roles
  • Rising operational costs
  • Fewer trained physicians are available on the floor

In this context, agentic AI emerges as a co-pilot. It supports overburdened staff by automating routine tasks, connecting data points, and offering intelligent recommendations that extend beyond the exam room.

One of the most compelling examples Mayani shares involves a patient with recurring asthma arriving at the emergency department. Traditionally, treatment would focus on the immediate clinical issue. But agentic AI can see the bigger picture:

  • It identifies that the patient lives near a pollution site
  • Notes missed follow-ups due to lack of transportation
  • Recognizes socioeconomic factors contributing to the recurring condition

With this information, the healthcare team can address the root cause, not just the symptom. This turns reactive treatment into proactive, preventative care, reducing waste and improving outcomes.

“Now you’re not treating a condition. You’re addressing a root cause.”

This approach is rooted in the Whole Person Care model, which Mayani recalls from his earlier career. While that model once relied on community health workers stitching together fragmented records, today’s agentic AI can do the same work; faster, more reliably, and at scale.

Agentic AI as a member of the care team

Ultimately, Mayani envisions agentic AI as a full-fledged member of the care team, one capable of:

  • Intervening earlier in a patient’s health journey
  • Coordinating care across departments and disciplines
  • Understanding and integrating the social determinants of health
  • Delivering on the promise of Whole Person Care

This marks a paradigm shift, from episodic, condition-focused care to integrated, data-driven, human-centered healing.

One of the most transformative promises of agentic AI in healthcare is its ability to identify root causes faster, significantly reducing both costs and systemic waste. As Anthony notes, the delay in getting to a solution often drives up costs unnecessarily, and Mayani agrees.

“Prevention is better than cure… and right now, as we are fighting costs and waste, it hasn’t been truer than any other time before.”

Agentic AI enables care teams to move from reactive service delivery to proactive problem-solving, aligning healthcare with long-promised, but rarely achieved, goals like holistic and whole-person care. The way Mayani describes it, this is now a practical, scalable reality.

COVID-19: A catalyst for AI innovation in supply chain resilience

Looking back at the COVID-19 pandemic, Mayani reflects on one of the biggest shocks to modern healthcare: supply chain collapse. It wasn’t due to a lack of data; healthcare generates 4x more data than most industries. The failure was one of foresight and preparedness.

“The supply chain broke not because we didn’t have the data, but because we didn’t have the foresight.”

This crisis has become a compelling event that has accelerated innovation. GHX’s own AI-driven Resiliency Center now includes early versions of systems that can:

  • Detect high-risk items like ventilator filters at risk of shortage
  • Recommend five clinically approved alternatives, sorted by cost, delivery time, and supplier reliability
  • Provide real-time, evidence-based recommendations across a multi-stakeholder ecosystem

Mayani likens this transformation to going from a smoke detector to a sprinkler system; not just identifying the problem, but acting swiftly to stop it before it spreads.

How AI is redefining cyber attack and defense strategies
As AI reshapes every aspect of digital infrastructure, cybersecurity is the most critical battleground where AI serves as both weapon and shield.
How Agentic AI is transforming healthcare delivery

Learning from crisis: Building a proactive future

COVID-19 may have been an unprecedented tragedy, but it forced healthcare organizations to centralize data, embrace cloud infrastructure, and accelerate digital transformation.

Before 2020, many health systems were still debating whether mission-critical platforms should move to the cloud. Post-crisis, the conversation shifted from adoption to acceleration, opening the door to advanced technologies like AI and GenAI.

“Necessity leads to innovation,” as Anthony puts it, and Mayani agrees.

The result is a more resilient, more responsive healthcare system, better equipped to navigate future challenges, from pandemics to geopolitical shifts to tariff policy changes. GHX now plays a pivotal role in helping suppliers and providers understand and act on these evolving variables through data visibility and decision-making intelligence.

AI hallucinations in healthcare

While agentic AI offers powerful capabilities, hallucinations remain a significant risk, particularly in healthcare, where errors can have devastating consequences. Archie Mayani openly acknowledges this challenge: even with high-quality, diverse, and rigorously governed datasets, hallucinations can still occur.

Drawing from his early work with diagnostic models for lung nodules and breast cancer detection, Mayani explains that hallucinations often stem from data density issues or incomplete contextual awareness. These can lead to outcomes like:

  • Recommending a nonexistent or back-ordered medical supply erodes trust
  • Incorrectly suggesting a serious diagnosis, such as early breast cancer, to a healthy individual

Both are catastrophic in their own way, and both highlight the need for fail-safes and human oversight.

GHX’s guardrails against AI errors

To mitigate these risks, GHX employs a multi-layered approach:

  • Validated data only. Models pull from active, verified medical supply catalogs.
  • Human-in-the-loop systems. AI makes predictions, but a human still approves the final decision.
  • Shadow mode training. New models run in parallel with human-led processes until they reach high reliability.
  • “Residency training” analogy. Mayani likens early AI models to junior doctors under supervision; they’re not allowed to operate independently until they’ve proven their accuracy.

This framework ensures that AI earns trust through performance, reliability, and responsibility, not just promises.

The future of agentic AI

When asked to predict the future of agentic AI in healthcare, Mayani presents a powerful vision: a world where AI becomes invisible.

“When AI disappears… that’s when we’ve truly won.”

He envisions a future where AI agents across systems, such as GHX’s Resiliency AI and hospital EMRs, communicate autonomously. A nurse, for instance, receives necessary supplies without ever placing an order, because the agents already anticipated the need based on scheduled procedures and clinical preferences.

Indicators of AI maturity

  • Care is coordinated automatically without juggling apps or administrative steps.
  • Agentic AI enables seamless, behind-the-scenes action, optimizing outcomes while removing friction.
  • Patients receive timely, affordable, personalized care, not because someone made a phone call, but because the system understood and acted.

This is the true potential of agentic AI: not to dazzle us with flashy features, but to blend so naturally into the work that it disappears.

The normalization of AI

As AI becomes more embedded in daily life, public perception is shifting from fear to discovery, and now, toward normalization. As Mayani and Anthony discuss, many people already use AI daily (in smartphones, reminders, and apps) without even realizing it.

The goal is for agentic AI to follow the same path: to support people, not replace them; to augment creativity, not suppress it; and to enable higher-order problem-solving by removing repetitive, predictable tasks.

“It’s never about the agents taking over the world. They are here so that we can do the higher-order bits.”

The path forward for Agentic AI in healthcare

The future of healthcare lies not in whether AI will be used but how. And leaders like Archie Mayani at GHX are laying the foundation for AI that is ethical, explainable, resilient, and invisible.

From predicting disruptions and recommending evidence-based alternatives to coordinating care and addressing root causes, agentic AI is already reshaping how we deliver and experience healthcare.

The next chapter is about when it quietly steps into the background, empowering humans to do what they do best: care.

How AI is redefining cyber attack and defense strategiesHow AI is redefining cyber attack and defense strategies

How AI is redefining  cyber attack and defense strategies

As AI reshapes every aspect of digital infrastructure, cybersecurity has emerged as the most critical battleground where AI serves as both weapon and shield.

The cybersecurity landscape in 2025 represents an unprecedented escalation in technological warfare, where the same AI capabilities that enhance organizational defenses are simultaneously being weaponized by malicious actors to create more sophisticated, automated, and evasive attacks.

The stakes have never been higher. Recent data from the CFO reveals that 87% of global organizations faced AI-powered cyberattacks in the past year, while the AI cybersecurity market is projected to reach $82.56 billion by 2029, growing at a compound annual growth rate of 28% .

This explosive growth reflects not just market opportunity, but an urgent response to threats that are evolving faster than traditional security measures can adapt.

Part 1: Adversaries in the age of AI

Cyber adversaries have found a powerful new weapon in AI, and they’re using it to rewrite the offensive playbook. The game has changed, with attacks now defined by automated deception, hyper-realistic social engineering, and intelligent malware that thinks for itself.

The industrialization of deception

The old security advice – “spot the typo, spot the scam” – is officially dead. Generative AI now crafts flawless, hyper-personalized phishing emails, texts, and voice messages that are devastatingly effective.

The numbers tell a chilling story: AI-generated phishing emails boast a 54% click-through rate, dwarfing the 12% from human-written messages. Meanwhile, an estimated 80% of voice phishing (vishing) attacks now use AI to clone voices, making it nearly impossible to trust your own ears.

Transforming industries and everyday life with AI applications
This article will cover a couple of highlights in which despite its risks and challenges, AI is transforming society.
How AI is redefining  cyber attack and defense strategies

This danger is not theoretical. Consider the Hong Kong finance employee who, in 2024, was tricked into transferring $25 million after a video conference where every single participant, including the company’s CFO, was an AI-generated deepfake.

In another cunning campaign, a threat group dubbed UNC6032 built fake websites mimicking popular AI video generators, luring creators into downloading malware instead of trying a new tool. The result is the democratization of sophisticated attacks. Tools once reserved for nation-states are now in the hands of common cybercriminals, who can launch convincing, scalable campaigns with minimal effort.

Malware that thinks for itself

The threat extends beyond tricking humans to the malicious code itself. Attackers are unleashing polymorphic and metamorphic malware that uses AI to constantly change its own structure, making it a moving target for traditional signature-based defenses.

The BlackMatter ransomware, for example, uses AI to perform live analysis of a victim’s security tools and then adapts its encryption strategy on the fly to bypass them.

On the horizon, things look even more concerning. Researchers have already designed a conceptual AI-powered worm, “Morris II,” that can spread autonomously from one AI system to another by hiding malicious instructions in the data they process.

At the same time, AI is automating the grunt work of hacking. AI agents, trained with Deep Reinforcement Learning (DRL), can now autonomously probe networks, find vulnerabilities, and launch exploits, effectively replacing the need for a skilled human hacker.

Part 2: Fighting fire with fire: AI on cyber defense

But the defense is not standing still. A counter-revolution is underway, with security teams turning AI into a powerful force multiplier. The strategy is shifting from reacting to breaches to proactively predicting and neutralizing threats at machine speed.

Seeing attacks before they happen

The core advantage of defensive AI is its ability to process data at a scale and speed no human team can match. Instead of just looking for known threats, AI-powered systems create a baseline of normal behavior across a network and then hunt for tiny deviations that signal a hidden compromise.

This is how modern defenses catch novel, zero-day attacks. The most advanced systems are even moving from detection to prediction. By analyzing everything from global attack trends to dark web chatter, and new vulnerabilities, AI models can forecast where the next attack wave will hit, allowing organizations to patch vulnerabilities before they’re ever targeted.

Transforming cybersecurity with AI
Discover how AI is transforming cybersecurity from both a defensive and adversarial perspective, featuring Palo Alto Networks’ CPO Lee Klarich.
How AI is redefining  cyber attack and defense strategies

Your newest teammate is an AI

The traditional Security Operations Center (SOC) – a room full of analysts drowning in a sea of alerts is becoming obsolete. In its place, the AI-driven SOC is rising, where AI automates the noise so humans can focus on what matters.

AI now handles alert triage, enriches incident data, and filters out the false positives that cause analyst burnout. We’re now seeing AI “agents” and “copilots” from vendors like Microsoft, CrowdStrike, and SentinelOne that act as true partners to security teams.

These AI assistants can autonomously investigate a phishing email, test its attachments in a sandbox, and quarantine every copy from the enterprise in seconds, all while keeping a human in the loop for the final say. This is more than an efficiency gain; it’s a strategic answer to the massive global shortage of cybersecurity talent.

Making zero trust a reality

AI is also the key to making the “never trust, always verify” principle of the Zero Trust security model a practical reality. Instead of static rules, AI enables dynamic, context-aware access controls.

It makes real-time decisions based on user behavior, device health, and data sensitivity, granting only the minimum privilege needed for the task at hand. This is especially vital for containing the new risks from the powerful but fundamentally naive AI agents that are beginning to roam corporate networks.

Part 3: The unseen battlefield: Securing the AI itself

For all the talk about using AI for security, we’re overlooking a more fundamental front in this war: securing the AI systems themselves. For the AIAI community – the architects of this technology – understanding these novel risks is not an option, it’s an operational imperative.

How AI can be corrupted

Machine learning models have an Achilles’ heel. Adversarial attacks exploit it by making tiny, often human-imperceptible changes to input data that cause a model to make a catastrophic error.

Think of a sticker that makes a self-driving car’s vision system misread a stop sign, or a slight tweak to a malware file that renders it invisible to an AI-powered antivirus. Data poisoning is even more sinister, as it involves corrupting a model’s training data to embed backdoors or simply degrade its performance.

A tool called “Nightshade” already allows artists to “poison” their online images, causing the AI models that scrape them for training to malfunction in bizarre ways.

Mastering data clustering: Your guide to K-means & K-means++
K-means clustering is an unsupervised machine learning algorithm used for clustering or grouping similar data points together in a dataset.
How AI is redefining  cyber attack and defense strategies

The danger of autonomous agents

With agentic AI, autonomous systems that can reason, remember, and use tools – the stakes get much higher. An AI agent is the perfect “overprivileged and naive” insider.

It’s handed the keys to the kingdom – credentials, API access, permissions – but has no common sense, loyalty, or understanding of malicious intent. An attacker who can influence this agent has effectively recruited a powerful insider. This opens the door to new threats like:

  • Memory poisoning: Subtly feeding an agent bad information over time to corrupt its future decisions.
  • Tool misuse: Tricking an agent into using its legitimate tools for malicious ends, like making an API call to steal customer data.
  • Privilege compromise: Hijacking an agent to exploit its permissions and move deeper into a network.

The need for AI red teams

Because AI vulnerabilities are so unpredictable, traditional testing methods fall short. The only way to find these flaws before an attacker does is through AI red teaming: the practice of simulating adversarial attacks to stress-test a system.

This is not a standard penetration test; it’s a specialized hunt for AI-specific weaknesses like prompt injections, data poisoning, and model theft. It’s a continuous process, essential for discovering the unknown unknowns in these complex, non-deterministic systems.

What’s next?

The AI revolution in cybersecurity is both the best thing that’s happened to security teams and the scariest development we’ve seen in decades.

With 73% of enterprises experiencing AI-related security incidents averaging $4.8 million per breach, and deepfake incidents surging 19% just in the first quarter of this year, the urgency couldn’t be clearer. This isn’t a future problem – it’s happening right now.

The organizations that will survive and thrive are those that can master the balance. They’re using AI to enhance their defenses while simultaneously protecting themselves from AI-powered attacks. They’re investing in both technology and governance, automation and human expertise.

The algorithmic arms race is here. Victory will not go to the side with the most algorithms, but to the one that wields them with superior strategy, foresight, and a deep understanding of the human element at the center of it all.

AI and the future of international student outreachAI and the future of international student outreach

AI and the future of international student outreach

My daily work in the EdTech industry consists of constant back-and-forth comparison between the United Kingdom’s admissions machine and the digital experiences offered at higher education systems elsewhere.

While UK universities debate workflow changes, universities in competing nations are plugging mass-scale AI systems directly into recruitment and immigration processes.

Unless we get similar equipment to work for us, ethically and at sector scale, the recent drop in foreign applications might be the beginning of a longer fall.

Global competition is accelerating

In March 2024, Reuters wrote that Microsoft and OpenAI are mulling a $100 billion U.S. super-computer project called Stargate to train the next generation of language models. Faster models translate to richer, more personalized student-facing services: anything from adaptive test preparation to multilingual visa counseling.

On the demand side, UNESCO’s 2024/5 Global Education Monitoring Report suggests that cross-border tertiary enrolments will rise by some two million seats by 2030, driven by South Asia and Sub-Saharan Africa most of all.

Potential students in these nations do much of their research online and respond quickly to chat-based counsel. It’s the perfect setting for today’s language models.

UK Government prioritizes AI for economic growth and services
The UK places AI at the center of its strategy for economic growth and improved public services, led by Science Secretary Peter Kyle.
AI and the future of international student outreach

Home-grown headwinds

The UK, on the other hand, has done the reverse. From 1 January 2024, the majority of international students will have the right to bring dependents with them, as a Home Office press release confirmed. And on 12 May 2025, the immigration white paper of the government laid out to cut Graduate-Route work rights to 18 months from two years (White Paper). Early figures show the impact is immediate: Universities UK reports a 44 percent decrease in January 2024 postgraduate-taught enrollments compared to the previous year.

The applicant’s maze

Policy changes contribute to an already fragmented process: multiple document portals, disproportionate English-language regulations, and inconsistent turnaround times.

When there are delays in official replies, students congregate in WhatsApp or Telegram groups, where misinformation and downright fraud spread rapidly. I’ve spoken with families who paid unlicensed agents just to upload PDFs that the university should have accepted for free.

Every lost or intimidated candidate blemishes the UK’s reputation for educational transparency.

Why AI now makes practical sense

Big language models at last give the missing layer of always-present, policy-aware advice. A model trained on the UKVI rule-set, UCAS codes, CAS logic, and institutional rules can answer subtle questions in dozens of languages, urge applicants if a bank letter or tuberculosis certificate is missing, pre-screen document quality before human officers waste time, and even emit rule changes the moment the Home Office updates its guidance.

All of that can happen in seconds, at any hour, without building yet another portal.

Lessons from building an admissions companion

We tested out an AI companion as an internal pilot during the past two years with my colleagues. We grounded the model on nothing but formal reports: Home-Office sponsor advice, university policy PDFs, anonymized email templates.

There are three things that we found most interesting:

  1. Context before horsepower. Precision was enhanced more by selecting clean source material than by changing to the latest model.
  2. Transparency fosters trust. Showing the precise paragraph a response is from, with a live link, cuts down on follow-up questions more than any tone adjustment.
    Staff want relief, not replacement. When admissions staff created lists of work they would offload first, “file-name formatting” and “duplicate-document emails” were first; no one feared for the loss of the human touch of their work.

Designing for interoperability and privacy

Any production-ready solution must read and write to installed CRMs and student-record systems using secure APIs, adhering to GDPR and local data-classification guidelines.

That integration approach avoids the creation of a standalone “AI portal” and maintains human agency: the model reports uncertainty, triages hard cases, and trains continuously from staff edits.

UK’s AI supercomputer to transform drug development
See how the UK’s £225m Isambard-AI supercomputer is set to revolutionize drug and vaccine development using cutting-edge AI technology.
AI and the future of international student outreach

Expected impact

Increased industry uptake will not remake immigration policy, but it can close the service gap currently sending applicants to competing destinations.

Small efficiency boosts translate into days shaved off offer release, percentage-point decreases in visa refusals, fewer out-of-hours queries: total millions of pounds of saved fee revenue and, more importantly, into goodwill among students perceiving the UK as responsive, not bureaucratic.

A collaborative path forward

Government, sector groups, and suppliers would collectively pay for a shared knowledge base for visa regulations, credential-evaluation standards, and regulatory reporting.

Local layers would then be adapted by individual universities without replicating the regulatory core. This kind of collaboration would mirror the infrastructure-first approach behind the U.S. Stargate initiative and enable the UK to continue to be a destination of choice for global talent.

Conclusion

Global recruitment is no longer a battle of shiny pamphlets but one of latency, language coverage, and policy correctness. AI will not undo discriminatory visa policies, but it will ensure that when opportunities are available, deserving students are not lost in bureaucratic fog.

For a sector generating more than £40 billion a year in exports, the deployment of ethical, interoperable AI is less an experiment and more a prudent maintenance of the UK’s competitive edge.

How large language models are transforming pediatric healthcareHow large language models are transforming pediatric healthcare

How large language models are transforming pediatric healthcare

What if artificial intelligence could help us solve some of the most complex challenges in pediatric healthcare, especially when it comes to rare diseases? 

At Great Ormond Street Hospital (GOSH), we face these challenges daily, treating children with some of the most difficult and rare conditions imaginable. But as powerful as human expertise is, we often find ourselves dealing with an overwhelming amount of data, from patient histories to diagnostic reports, making it hard to extract the insights we need quickly and efficiently.

This is where artificial intelligence and machine learning come in. These technologies have the potential to revolutionize the way we process and utilize healthcare data. At GOSH, we’re leveraging AI, particularly large language models (LLMs), to tackle the complexity of this data and improve patient outcomes.

In this article, I’ll share insights from our journey of integrating AI into pediatric healthcare at GOSH and how AI is helping us improve care, streamline operations, and make healthcare more accessible for children with rare diseases.

Let’s dive in.

The role of GOSH’s DRIVE unit

In 2018, we established the DRIVE unit, which stands for Data, Research, Innovation, and Virtual Environment. Our goal? To harness data and technology to improve outcomes for children, families, and our healthcare staff. 

We want to make GOSH the global go-to center for pediatric innovation, and we aim to do this by utilizing AI and data to drive breakthroughs in treatment, diagnosis, and patient care.

Our mission goes beyond merely innovating for the sake of it; we want to use AI to make an impact not just locally but globally. The data we collect is especially valuable for research, particularly in the realm of rare diseases – conditions that often don’t receive enough attention due to their rarity. 

But how do we make sense of this relatively small dataset, and how can we share this knowledge globally? That’s one of the questions we’ve been working to answer with the help of AI and ML (machine learning).

Transforming Healthcare with AI: Interview series
Our interview series is here to deliver you digestible intelligence from the organizations and innovators leading the world of AI in healthcare – through expert and in-depth interviews.
How large language models are transforming pediatric healthcare

Harnessing data for research and operational efficiencies

In terms of data management, GOSH has undergone a massive overhaul in the past few years. Before 2019, we were using over 400 different systems for collecting patient data. As you can imagine, this was both inefficient and hard to maintain.

That’s when we made the strategic decision to replace our outdated systems with a single platform – EPIC. This transition has allowed us to integrate all patient data into a unified electronic health record system.

How large language models are transforming pediatric healthcare

Humans in the loop: How leading companies are building practical, trustworthy AIHumans in the loop: How leading companies are building practical, trustworthy AI

Humans in the loop:  How leading companies are building practical, trustworthy AI

At the NYC Generative AI Summit, experts from Wayfair, Morgan & Morgan, and Prolific came together to explore one of AI’s most pressing questions: how do we balance the power of automation with the necessity of human judgment? 

From enhancing customer service at scale to navigating the complexity of legal workflows and optimizing human data pipelines, the panelists shared real-world insights into deploying AI responsibly. In a field moving at breakneck speed, this discussion was an opportunity to examine how we can build AI systems that are effective, ethical, and enduring.

From support to infrastructure: Evolving with generative AI

Generative AI is reshaping industries at a pace few could have predicted. And at Wayfair, that pace is playing out in real time. Vaidya Chandrasekhar, who leads pricing, competitive intelligence, and catalog ML algorithms at the company, shared how their approach to generative AI has grown from practical customer support tools to foundational infrastructure transformation.

Early experiments started with agent assistance, particularly in customer service. These included summarizing issue histories and providing real-time support to customer-facing teams, the kind of use cases many companies have used as a generative AI entry point.

From there, Wayfair moved into more technical territory. One significant area has been technology transformation: shifting from traditional SQL stored procedures toward more dynamic systems. 

“We’ve been asking questions like, ‘if you’re selecting specific data points and trying to understand your data model’s ontology, what would that look like as a GraphQL query?’” Vaidya explained. While not all scenarios fit the model, roughly 60-70% of use cases have proven viable.

Perhaps the most transformative application is in catalog enrichment, which is at the core of Wayfair’s operations. Generative AI is being used to enhance and accelerate how product data is organized and surfaced. And in such a fast-moving environment, agility is key. 

“Just this morning, we were speaking with our CEO. What the plan was two months ago is already shifting,” Vaidya noted. “We’re constantly adapting to keep pace with what’s possible.”

The company is firmly positioned at the edge of change, continuously testing how emerging tools can bring efficiency, clarity, and value to both internal workflows and customer experiences.

Building AI inside the nation’s largest injury law firm

When most people think of personal injury law firms, they don’t picture teams of software engineers writing AI tools. But that’s exactly what’s happening at Morgan & Morgan, the largest injury law firm in the United States. 

Paras Chaudhary, Software Engineering Lead at Morgan & Morgan, often gets surprised reactions when he explains what he does. “They wonder what engineers are even doing there,” he said.

The answer? Quite a lot, and increasingly, that work involves generative AI.

Law firms, by nature, have traditionally been slow to adopt new technologies. The legal profession values precedent, structure, and methodical processes, qualities that don’t always pair easily with the fast-evolving world of AI. 

But Morgan & Morgan is taking a different approach. With the resources to invest in an internal engineering team, they’re working to lead the charge in legal AI adoption.

The focus isn’t on replacing lawyers, but empowering them. “I hate the narrative that AI will replace people,” Paras emphasized. “What we’re doing is building tools that make attorneys’ lives easier: tools that help them do more, and do it better.”

Of course, introducing new technology into a non-technical culture comes with its own challenges. Getting attorneys, many of whom have been doing things the same way for a decade or more, to adopt unfamiliar tools isn’t always easy. 

“It’s been an uphill battle,” he admitted. “Engineering in a non-tech firm is hard enough. When your users are lawyers who love their ways of doing things, it’s even tougher.”

Despite the resistance, the team has had measurable success in deploying generative AI internally. And equally important, they’ve learned from their failures. The journey has been anything but flashy, but it’s quietly reshaping how legal work can be done at scale.

Why agentic AI pilots fail and how to scale safely
Why agentic AI pilots stall, what causes failure, and how enterprises can scale AI safely with strong governance and access controls.
Humans in the loop:  How leading companies are building practical, trustworthy AI

Human data’s evolving role in AI: From volume to precision

While much of the conversation around generative AI focuses on model architecture and compute power, Sara Saab, VP of Product at Prolific, brought a vital perspective to the panel: the role of human data in shaping AI systems. Prolific positions itself as a human data platform, providing human-in-the-loop workflows at various stages of model development, from training to post-deployment.

“This topic is really close to my heart,” Sara shared, reflecting on how drastically the human data landscape has shifted over the last few years.

Back in the early days of ChatGPT, large datasets were the core currency. “There was an arms race,” she explained, “where value was all about having access to massive amounts of training data.” 

But in today’s AI development pipeline, that’s no longer the case. Many of those large datasets have been distilled down, commoditized, or replaced by open-source alternatives used for benchmarking.

The industry’s focus has since shifted. In 2023 and into 2024, efforts moved toward fine-tuning, both supervised and unsupervised, and the rise of retrieval-augmented generation (RAG) approaches. Human feedback became central through techniques like RLHF (reinforcement learning from human feedback), though even those methods have begun to evolve.

“AI is very much a white-paper-driven industry,” Sara noted. “Every time a new paper drops, everyone starts doing everything differently.” Innovations like reinforcement learning via rejection sampling or variational reward shaping (RLVR) began to reduce the need for heavy fine-tuning, at least on the surface. But peel back the layers, she argued, and humans are still deeply embedded in the loop.

Today, the emphasis is increasingly on precise, expert-curated datasets, the kind that underpin synthetic data generators, oracle solvers, and other sophisticated human-machine orchestrations. These systems are emerging as critical to the next generation of model training and evaluation.

At the same time, foundational concerns around alignment, trust, and safety are rising to the surface. Who defines the benchmarks on which models are evaluated? Who assures their quality?

“We look at leaderboards with a lot of interest,” Sara said. “But we also ask: who’s behind those benchmarks, and what are we actually optimizing for?”

It’s a timely reminder that while the tooling and terminology may shift rapidly, the human element, in all its philosophical, ethical, and practical complexity, remains central to the future of AI.

LLM economics: How to avoid costly pitfalls
Avoid costly LLM pitfalls: Learn how token pricing, scaling costs, and strategic prompt engineering impact AI expenses—and how to save.
Humans in the loop:  How leading companies are building practical, trustworthy AI

Human oversight in AI: The power of boring integration

As large language models (LLMs) become increasingly capable, the question of human oversight becomes more complex. How do you keep humans meaningfully in the loop when models are doing more of the heavy lifting, and doing it well? 

For Paras, the answer isn’t flashy tools or complex interfaces. It’s simplicity, even if that means embracing the boring.

“Our workflows aren’t fancy because they didn’t need to be,” he explained. “Most of the human-in-the-loop flow at the firm is based on approval mechanisms. When the model extracts ultra-critical information, a human reviews it to confirm whether it makes sense or not.”

To drive adoption among lawyers, notoriously resistant to change, Paras applied what he calls the “radio sandwiching” approach. “Radio stations introduce new songs by sandwiching them between two tracks you already like. That way, the new stuff feels familiar and your alarms don’t go off,” he said. “That’s what we had to do. We disguised the cool AI stuff as the boring workflows people already knew.”

At Morgan & Morgan, that meant integrating AI into the firm’s existing Salesforce infrastructure, not building new tools or expecting users to learn new platforms. “All our attorney workflows are based in Salesforce,” Paras explained. “So we piped our AI outputs right into Salesforce, whether it was case data or something else. That was the only way to get meaningful adoption.”

When asked if this made Salesforce an annotation platform, Paras didn’t hesitate. “Exactly. It works. Do what you have to do. Don’t get stuck on whether it looks sexy. That’s not the point.”

Vaidya Chandrasekhar, who leads pricing and ML at Wayfair, echoed the sentiment. “I agree with a lot of what Paras said,” he noted. “I’d frame it slightly differently; it’s about understanding where machine intelligence kicks in, and where human judgment still matters. You’re always negotiating that balance. But yes, integrating into existing, familiar workflows is essential.”

As AI systems evolve, the methods for keeping humans involved might not always be elegant. But as this panel made clear, pragmatism often beats perfection when it comes to real-world deployment.

Orchestrating intelligence: How humans and AI learn to work together

As the conversation turned to orchestration, the complex collaboration between humans and machines, Paras offered a grounded view shaped by hard-earned experience.

“I did walk right into that,” he joked as the question was directed his way. But his answer made it clear he’s thought deeply about this dynamic.

For Paras, orchestration isn’t about building futuristic autonomy. It’s about defining roles and designing practical workflows. “There are definitely some tasks machines can handle on their own,” he said. “But the majority of the work we do involves figuring out which parts to automate and where humans still need to make decisions.”

He emphasized that the key is not treating the system as a black box, but instead fostering a loop in which humans improve the AI by correcting, contextualizing, and even retraining it over time. “The job of humans is to continue evolving these machines,” he said. “They don’t get better on their own.”

Paras also highlighted the importance of being able to pause and escalate AI systems when needed, especially when the model encounters something novel or ambiguous. He gave the example of defining a new item like “angular stemless glass.”

“You don’t want the model to just make it up and run with it,” he said. “You want it to scrape the internet, make its best guess, and then ask a human – Is this right?

That ability for the system to admit uncertainty is central to how Paras thinks about orchestration. “It’s like hiring someone new,” he said. “The smartest people still need to know when to raise their hand and say, I’m not sure about this. That’s the critical skill we need to train into our AI.”

Human + AI: Rethinking the roles and skills of knowledge workers
AI is reshaping knowledge work: changing roles, redefining skills, and putting human judgment at the heart of an automated future.
Humans in the loop:  How leading companies are building practical, trustworthy AI

Why determinism matters in AI orchestration

As the panel discussion on orchestration continued, Paras offered a grounded counterpoint to the rising excitement around agentic systems and autonomous AI decision-making.

“If I didn’t believe in a hybrid world between humans and machines, I wouldn’t be sitting here,” he said. “But let me be clear: at our firm, we have no interest in dabbling with agent tech.”

While many startups and venture-backed companies are chasing autonomous agents that can reason, plan, and act independently, Paras argued that this kind of complexity introduces too much uncertainty. “Agent tech creates too many steps, too many potential points of failure. And when the probability of failure multiplies across those steps, the overall chance of success drops.”

Instead, Paras advocated for an orchestration model grounded in determinism, where workflows are tightly scoped, predictable, and easily governed by clear logic.

“I love it when orchestration is deterministic,” he said. “That could mean a simple if/else statement. It could mean a human approver. What matters is that the system behaves in a way that’s traceable, testable, and reliable.”

At Morgan & Morgan, where the stakes are high and the work is bound by legal and procedural constraints, this type of orchestration isn’t just a preference, it’s a necessity. “We’re not in a startup trying to sell a dream,” Paras pointed out. “We’re in a firm where outcomes matter, and we need to know the system will work as expected.”

That pragmatic approach may not sound flashy, but it’s exactly what’s enabling his team to make real, measurable progress. By prioritizing reliability over autonomy, they’re proving that impactful AI doesn’t always need to be cutting-edge; it needs to be dependable.

The broader conversation circled back to how the most valuable AI systems are the ones that know when they don’t know and are built to ask for help.

Human limits and AI benchmarks

As the discussion shifted toward AI orchestration, Sara paused to reflect on a subtler but essential thread: the limits of human understanding and how they shape the systems we build.

“Chain-of-thought reasoning and explainability in AI are fascinating,” she said. “But what’s just as fascinating is that humans aren’t always that explainable either. We often don’t know why we know something.”

That tension between human intuition and machine logic quickly leads to deeper questions. “Whenever I talk about these topics, we’re always two questions away from a philosophy lecture,” Sara joked. “What are the limits of human intelligence? Who quality-assures our own thinking? Are some of the world’s most unsolvable math problems even well formulated?”

These aren’t just abstract musings. In the context of large language models (LLMs), they expose a critical challenge: Can we ever be sure that models are doing what we expect them to do?

This thread naturally led Sara into a critique of how the industry measures performance. “Right now, we’re living in a leaderboard-driven moment,” she said. “Top-of-the-leaderboard has become a kind of default OKR, a stand-in for state-of-the-art.”

But that raises deeper concerns about accountability and meaning. What does it really mean for a model to be aligned, safe, or trustworthy? And perhaps more importantly, who decides?

“I’m always curious and a bit skeptical about who’s defining and scoring these benchmarks,” Sara added. “Who’s grounding the definitions of concepts like ‘verbosity’ or ‘alignment’? What counts as success, and who gets to say?”

These questions aren’t just philosophical; they’re foundational. As AI systems become more central to how decisions are made, the frameworks we use to evaluate them will increasingly shape what we build, what we trust, and what we ignore.

Sara’s insight served as a quiet but powerful reminder: in the rush toward smarter models and more automation, human judgment, with all its limits, still defines the boundaries of AI progress.

The moving target of AI benchmarks and human judgment

As the panel delved deeper into the topic of alignment and accountability, one question emerged front and center: Who gets to define the benchmarks that guide AI development? And perhaps more importantly, are those benchmarks grounded in human understanding, or just technical performance?

The challenge, according to Paras, lies in the fact that alignment is not static.

“It depends,” he said. “Alignment is always evolving.” From his perspective, the most important factor is recognizing where and how human input should be embedded in the process.

Paras pointed to nuanced judgment as a key domain where humans remain indispensable. “You might have alignment today, but taste changes. Priorities shift. What was acceptable last month might feel outdated next quarter,” he explained. “LLMs are like snapshots; they reflect a frozen point in time. Humans bring the real-time context that models simply can’t.”

He also emphasized the limits of what AI models can process. “You can’t pass in everything to an LLM,” he noted. “Some of the most valuable context, institutional knowledge, soft cues, and ethical boundaries live outside the prompt window. That’s where human judgment steps in.”

This makes benchmark-setting especially tricky. As use cases become more complex and cultural expectations continue to evolve, the metrics we use to measure alignment, safety, or usefulness must evolve too. And that evolution, Paras argued, has to be guided by humans, not just product teams or model architects, but people with a deep understanding of the problem domain.

“It’s not a perfect science,” he admitted. “But as long as we keep humans close to the loop, especially where the stakes are high, we can keep grounding those benchmarks in reality.”

In short, defining success in AI is a constant process of recalibration, driven by human judgment, values, and the ever-shifting landscape of what we expect machines to do.

How to optimize LLM performance and output quality: A practical guide
Discover how to boost LLM performance and output quality with exclusive tips from Capital One’s Divisional Architect.
Humans in the loop:  How leading companies are building practical, trustworthy AI

The unsolved challenge of human representation in AI

As the panel explored the complexities of benchmarking and alignment, Sara turned the spotlight onto a fundamental and unresolved challenge: human representation in AI systems.

“Humans aren’t consistent with each other,” she began. “At Prolific, we care deeply about sourcing data from representative populations, but that creates tension. The more diverse your data sources are, the more disagreement you get on the ground truth. And that’s a really hard problem, I don’t think anyone has solved it yet.”

Most human-in-the-loop pipelines today rely on contributors from technologically advanced regions, creating a skewed perspective in what AI systems learn and reinforce. While it may be more convenient and accessible, the trade-off is systems that reflect a narrow slice of humanity and fail to generalize across cultures, languages, or values.

Paras expanded on that point by reminding the group of what LLMs really are at their core: “stochastic parrots.”

“They learn by mimicking human language,” he said. “So, if humans are biased, and we are, models will be biased too, often in the same ways or worse.” He drew a parallel to broader democratic ideals. “We all believe in democracy, but how many people actually feel represented by the people they vote for? If we haven’t figured out representation for humans, how can we expect to figure it out for language models?”

That philosophical thread, the limits of objectivity, the challenge of consensus, keeps resurfacing in AI conversations, and with good reason. As Paras put it, “Almost every problem in AI eventually becomes a philosophical question.”

Vaidya added a practical layer to the discussion, drawing on his experience with AI-generated content. Even when a model produces something that’s technically accurate or politically correct, it doesn’t mean it fits the intended use. “You have to ask: is this aligned with the tone, context, and audience we’re targeting?” he said.

Vaidya emphasized the value of multi-perspective prompting, asking the model to generate outputs as if different personas were viewing the same content. “What would this look like to a middle-aged person? What would a kid want to see? If the answers are wildly different, it’s a signal to bring in a human reviewer.”

In short, representation in AI is about surfacing variability, noticing it, and knowing when to intervene. And as all three panelists acknowledged, this challenge is still very much in progress.

Will humans always be in the loop?

As the panel drew to a close, the moderator posed a final, essential question: Will humans always be part of the AI loop? And if not, where might they be phased out?

It’s a question that sits at the heart of current debates around automation, accountability, and the future of work, and one that Paras didn’t shy away from.

“I hope we’re a part of the process and that this doesn’t turn into a Terminator situation anytime soon,” he joked. But humor aside, Paras emphasized that we’re still searching for an equilibrium between human judgment and machine autonomy. “We’re not there yet,” he said, “but we’re getting closer. As we build more of these systems, we’ll naturally find that balance.”

Paras pointed to a few specific use cases where agentic AI systems that can act autonomously without human intervention have started to show real promise. 

“Research and code generation are the two strongest examples so far,” he noted. “If you pull out the human for a while, those agents still manage to perform reasonably well.”

But beyond those narrow domains, full autonomy still raises red flags.

“The truth is, even if AI can technically handle something, we still need a human in the loop, not because we can do it better, but because we need accountability,” Paras explained. “We need someone to point the finger at when things go wrong.”

This is why, despite years of development in AI and machine learning, fields like law and medicine have remained cautious adopters. “It’s not that the technology isn’t there,” Paras said. “It’s that when things go south, someone has to be responsible.”

And that need for traceability, interpretability, and yes, someone to blame, is unlikely to disappear anytime soon.

In a world that increasingly leans on AI to make decisions, keeping humans in the loop may be less about capability and more about ethics, governance, and trust. And for now, that role remains irreplaceable.

Final thoughts: Accountability and skill loss in an AI-driven future

As the conversation on human oversight neared its conclusion, Vaidya added a final and urgent perspective: we may be underestimating what we lose when we over-automate.

“One thing to keep in mind,” he said, “is that when we talk about AI performance today, we’re often comparing the entry-level output of a model with the peak performance of a human.”

That’s a flawed baseline, he argued, because while the best human output has a known ceiling, AI capability is continuing to grow rapidly. “What models can do today compared to just six months ago is mind-boggling,” he added. “And it’s only accelerating.”

But Vaidya’s deeper concern wasn’t just about the rate of improvement; it was about the risk of atrophy.

“I was just chatting with someone outside,” he shared. “They said people are going to forget how to write. And that stuck with me.”

The fear is that humans will lose foundational skills before we’ve built the safeguards to do those tasks well through automation. “That’s the danger,” Vaidya said. “We’re handing over control while our own capabilities fade, and without proper checks, we won’t notice until it’s too late.”

To avoid that future, Vaidya made a call to action for leaders and organizations: to treat human skill preservation and accountability mechanisms as part of responsible AI adoption.

“It’s on us to design for that,” he said. “We need systems, formal or informal, that ensure we retain critical human capabilities even as we scale what machines can do.”

As AI continues to evolve, this perspective added a final layer of nuance to the panel’s core message: progress doesn’t just mean doing more, it means knowing what to protect along the way.

LLMOps in action: Streamlining the path from prototype to productionLLMOps in action: Streamlining the path from prototype to production

LLMOps in action: Streamlining the path from prototype to production

AIAInow is your chance to stream exclusive talks and presentations from our previous events, hosted by AI experts and industry leaders.

It’s a unique opportunity to watch the most sought-after AI content – ordinarily reserved for AIAI Pro members. Each stream delves deep into a key AI topic, industry trend, or case study. Simply sign up to watch any of our upcoming live sessions. 

🎥 Access exclusive talks and presentations
✅ Develop your understanding of key topics and trends
🗣 Hear from experienced AI leaders
👨‍💻 Enjoy regular in-depth sessions


LLMOps is emerging as a critical enabler for organizations deploying large language models at scale – bringing data scientists, engineers, and end-users into tighter, more effective collaboration.

Join us for a deep dive into the operational backbone of successful LLM deployments. From model design to monitoring in production, you’ll learn how to unlock the full potential of LLMs with streamlined processes, smarter tooling, and cross-functional alignment.


In this session, you’ll explore:

🧠 What LLMOps is – and why it’s essential for scalable AI success
🔄 The full LLMOps lifecycle: from experimentation to deployment and iteration
🤝 How to accelerate collaboration between data teams, engineers, and business users
🧰 Practical frameworks and tools for building robust LLM pipelines
📊 Real-world case studies showcasing high-impact LLM applications
⚠️ Common challenges in LLMOps – and how to overcome them

Whether you’re an AI practitioner, developer, or team leader, this session will equip you with the insights and strategies to operationalize LLMs with confidence.


Meet the speaker:

Samin Alnajafi, AI Solutions Engineer, Weights & Biases

Samin Alnajafi is an accomplished Pre-Sales AI Solutions Engineer at Weights & Biases, specializing in AI-powered solutions for enterprise clients across EMEA. With experience at tech leaders such as Snowflake and DataRobot, he excels in guiding organizations through the technical intricacies of machine learning and data-driven innovation. His expertise spans large language model operations, sales engineering, and AI solutions, making him a valued advisor in deploying transformative technologies for a range of industries.

LLMOps in action: Streamlining the path from prototype to production

How to optimize LLM performance and output quality: A practical guideHow to optimize LLM performance and output quality: A practical guide

How to optimize LLM performance and output quality: A practical guide

Have you ever asked generative AI the same question twice – only to get two very different answers?

That inconsistency can be frustrating, especially when you’re building systems meant to serve real users in high-stakes industries like finance, healthcare, or law. It’s a reminder that while foundation models are incredibly powerful, they’re far from perfect.

The truth is, large language models (LLMs) are fundamentally probabilistic. That means even slight variations in inputs – or sometimes, no variation at all – can result in unpredictable outputs. 

Combine that with the risk of hallucinations, limited domain knowledge, and changing data environments, and it becomes clear: to deliver high-quality, reliable AI experiences, we must go beyond the out-of-the-box setup.

So in this article, I’ll walk you through practical strategies I’ve seen work in the field to optimize LLM performance and output quality. From prompt engineering to retrieval-augmented generation, fine-tuning, and even building models from scratch, I’ll share real-world insights and analogies to help you choose the right approach for your use case.

Whether you’re deploying LLMs to enhance customer experiences, automate workflows, or improve internal tools, optimization is key to transforming potential into performance.

Let’s get started.

The problem with LLMs: Power, but with limitations

LLMs offer immense potential – but they’re far from perfect. One of the biggest pain points is the variability in output. As I mentioned, because these models are probabilistic, not deterministic, even the same input can lead to wildly different outputs. If you’ve ever had something work perfectly in development and then fall apart in a live demo, you know exactly what I mean.

Another well-known issue? Hallucinations. LLMs can be confidently wrong, presenting misinformation in a way that sounds convincing. This happens due to the noise and inconsistency in the training data. When models are trained on massive, general-purpose datasets, they lack the depth of understanding required for domain-specific tasks.

And that’s a key point – most foundation models have limited knowledge in specialized fields. 

Let me give you a simple analogy to ground this. Think of a foundation model like a general practitioner. They’re great at handling a wide range of common issues – colds, the flu, basic checkups. But if you need brain surgery, you’re going to see a specialist. In our world, that specialist is a fine-tuned model trained on domain-specific data.

With the right optimization strategies, we can transform these generalists into specialists – or at least arm them with the right tools, prompts, and context to deliver better results.

LLMOps in action: How we move GenAI from prototype to production
Struggling to get your GenAI prototype into production? Discover how LLMOps helps streamline deployment – fast, scalable, and reliable.
How to optimize LLM performance and output quality: A practical guide

Four paths to performance and quality

When it comes to improving LLM performance and output quality, I group the approaches into four key categories:

  1. Prompt engineering and in-context learning
  2. Retrieval-augmented generation (RAG)
  3. Fine-tuning foundation models
  4. Building your own model from scratch

Let’s look at each one.

1. Prompt engineering and in-context learning

Prompt engineering is all about crafting specific, structured instructions to guide a model’s output. It includes zero-shot, one-shot, and few-shot prompting, as well as advanced techniques like chain-of-thought and tree-of-thought prompting.

Sticking with our healthcare analogy, think of it like giving a detailed surgical plan to a neurosurgeon. You’re not changing the surgeon’s training, but you’re making sure they know exactly what to expect in this specific operation. You might even provide examples of previous similar surgeries – what went well, what didn’t. That’s the essence of in-context learning.

This approach is often the simplest and fastest way to improve output. It doesn’t require any changes to the underlying model. And honestly, you’d be surprised how much of a difference good prompting alone can make.

2. Retrieval-augmented generation (RAG)

RAG brings in two components: a retriever (essentially a search engine) that fetches relevant context, and a generator that combines that context with your prompt to produce the output.

Let’s go back to our surgeon. Would you want them to operate without access to your medical history, recent scans, or current health trends? Of course not. RAG is about giving your model that same kind of contextual awareness – it’s pulling in the right data at the right time.

This is especially useful when the knowledge base changes frequently, such as with news, regulations, or dynamic product data. Rather than retraining your model every time something changes, you let RAG pull in the latest info.

Human + AI: Rethinking the roles and skills of knowledge workersHuman + AI: Rethinking the roles and skills of knowledge workers

Human + AI:  Rethinking the roles and skills of knowledge workers

Artificial intelligence is not just another gadget; it’s already shaking up how white-collar jobs work.

McKinsey calls this shift an arrival at superagency, a space where machines think alongside people and the two groups spark new bursts of creativity and speed. Suddenly, the click-by-click chores-plowing through code, crunching spreadsheets, scrubbing datasets-are handled by bots, letting human brains leap to bigger questions.

Software developers, for instance, now spend more energy sketching big-picture road maps than wrestling syntax errors. Data scientists swap grinding model tweaks for debating which human questions an AI model really answers. In every corner of knowledge work, the quiet obey-yesterday-tasks face is evaporating.

The revamped role of the knowledge worker is equal parts translator, coach, and ethical guardian. Successful pros read the business landscape, nudge AI tools in the right direction, and steer output so it stays inside value lines. Human judgment steps in when computers run out of context, making it the real superpower of the partnership.

Some experts have taken to calling us AI strategists, a pivot away from the older task executor label. We use machines as sturdy scaffolding, letting us build fresher ideas faster while keeping accountability firmly in hand.

Skills for human-AI collaboration 

Living in a world that teams up humans and machines is no longer a sci-fi plot; it’s the daily grind for millions. A recent World Economic Forum report warns that nearly 39% of the skills we brag about on our resumes will be different by 2030, and tech is doing the heavy lifting. This figure represents great disruption but is down from 44% in 2023. 

Right now, big-ticket items like AI, Big Data, and cybersecurity sit at the head of the table, with cloud know-how and solid digital literacy close behind. Wages for professionals in those fields already show it. But numbers alone aren’t enough. Hiring managers keep shouting out for soft skills, too. Creative thinking, bounce-back strength, and plain old curiosity keep sneaking onto every shortlist we see.

The classic bedrock talents-leadership, talent management, and sharp-eyed analysis aren’t going anywhere either. Recruiters still want people who can steer teams and sway an audience while keeping facts straight. Long story short, the winning mix for tomorrow’s worker is hi-tech fluency slapped together with high-touch judgment.

Human + AI:  Rethinking the roles and skills of knowledge workers

Key skill areas include:

AI and data literacy

Understanding how to work with AI systems is imperative, from preparing effective prompts to interpreting a model’s output.

Workers must learn to gauge an AI suggestion and realise its value or shortcomings – whether based on accuracy, bias, or security concerns – and blend those insights into the final decision. Data and statistics will always be important.

Critical and strategic thinking

When routine tasks are automated, the human side of problem framing, strategy, and design can shine through.

This means developing, along with domain expertise, long-term thinking: choosing the right technology tools, architecting resilient systems, and carving out innovative ways to do things. The ability to envision strategic applications of AI for processes, rather than simply applying it to a single task, will set leaders apart.

Creativity and innovation

The human realm will generate fresh ideas, brainstorm new physical or digital products or services, and think outside the algorithmic box. According to the WEF data, roles that require these abilities, such as engineering new fintech solutions, envisioning new educational curricula with AI, or designing novel avenues for public service, are growing rapidly.

Emotional intelligence and ethics

The human right of empathy and social judgment cannot (yet) be emulated by AI. When many things are automated, skills like communication, collaboration, negotiation, and emotional nuance increase in desirability.

For instance, knowledge workers must manage the human side of operations to interpret and present results to various stakeholder groups, ensuring that the application of AI is within an ethical framework.

For example, UNDP’s “AI for Government” program trains government officials to deal with the legal, social, and bias aspects of AI issues, emphasising that AI deployment needs to be regulated and humanized by public servants. 

UK Government prioritizes AI for economic growth and services
The UK places AI at the center of its strategy for economic growth and improved public services, led by Science Secretary Peter Kyle.
Human + AI:  Rethinking the roles and skills of knowledge workers

Adaptability and lifelong learning

Technology continues to change rapidly, making learning for each role an absolute necessity. While adaptability, curiosity, and a growth mindset are focal points for experts, this is all about workers updating their skill sets again and again with the best practices of a new awareness of AI capabilities.

Organizations should promote a culture of continuous learning because WEF has noted that investing in upskilling programs is already a crucial guarantee of future-readiness.

In summary, we can state that skills are understood as the ability to coordinate across the human-AI frontier. Applied technical knowledge entails using the data, AI platform, and software tool, whereas higher-order thinking concerns analysis, strategy, ethics, and soft skills, such as communication and leadership.

The demand will be for those who can straddle these domains; hence, a finance analyst with a working knowledge of machine learning or a government officer familiar with data policy and stakeholder engagement is a rare find.

Organizational strategies for reskilling and transformation

Bridging the gap between people and machines requires more than shiny new software; it calls for a deliberate shift in how teams operate. Leaders who tinker with job titles but stop there risk missing the moment, so they must rethink workflows, back training with dollars, and keep learning in plain view instead of hiding it in quarterly targets. Several approaches are starting to catch on: 

Step back and redesign the whole operating model before you even think about flipping the automation switch. Slapping code onto a clunky process only glues the bad parts together. Grab a whiteboard, outline the steps again, and look for a cleaner route.

Process-mining software can trace every click and keystroke, exposing the stalled choke points that slow everyone down. With that map in hand, you can chop unnecessary work, slot in AI where it crunches numbers faster than a person would, and set humans loose on the judgment-heavy tasks only they can handle.

Take the story of IBM’s HR crew: they stripped the quarterly promotions grind of manual busywork by letting a custom Watsonx Orchestrate choreograph the data fetch, freeing the team to focus on tough calls about talent rather than hunting spreadsheets.

Invest as boldly in your people as you do in code, readying the workforce for the tremors AI and other waves of tech will send through the usual order.

Right now, HR is poised at a turning point, and the folks in those seats need to sketch how humans and machines will pull the organization’s heavy wagon forward.

Someone has to spot the high-value corners of the business, carve out the keystone positions, and map which skills, certainly not all of them, are going to matter most tomorrow.

That handiwork means trimming away repetitive errands that a bot can swallow, sometimes joining two titles into one, sometimes enlarging a job so it drags an AI dashboard into the daylight, and all the while cooking up quick-hit training that lets real people handle the meatier tasks.

Make skills the heart of your workforce plan, both for the challenges employees face now and for those still on the horizon. Leaders should worry less about flashy projects and focus on steadily lifting everyone’s tech know-how, because that solid base is what lets people branch out and try fresher things.

Many roles will not ever require serious coding, yet most team members will inevitably play with new-generation AI tools, so a little exposure goes a long way. When staff grasp the basics of artificial intelligence, they think critically, use the software sensibly, and even push back when something feels off.

Asking what data trained a model, how it arrived at a given output, or whether hidden bias lurks in the results stops being an academic debate and starts sounding like standard procedure.

Technology itself can play a role in personal growth. Point-and-click roadmaps that update as the market shifts show each person precisely what steps and what skills prepare them for the next rung on the ladder. Delta Airlines leaned on IBM Consulting to spin up just such a skills-first talent hub, and the IT crew there ramped up quickly on the hottest technologies.

Beyond today, every firm is staring at a yawning AI skills gap that won’t fix itself; filling that chasm demands deliberate hiring, strategic learning budgets, and a bit of patience while new talent rises through the ranks.

Can AI widen customer and employee engagement gap
AI has the potential to both widen and narrow the gap between employees and customers and ultimately cause brand impact, depending on how it is implemented and utilized within an organization.
Human + AI:  Rethinking the roles and skills of knowledge workers

Let employees steer their own work, and suddenly jobs stop feeling like drudgery. When teams get to pick the tasks they hate, the routine pain melts away.

Generative software picks up the monotonous load and hands people back the hours they used to waste repeating the same clicks. New openings pop up organically, since folks now have breathing room to try odd experiments that might just turn into career paths.

Open channels matter, so project lists, quick polls, even a spare Slack room where anybody can shout, “Hey, this job could use a robot, keep the ideas flowing.” A steady stream of feedback like that also acts as a low-key boot camp for future leaders because they get to practice owning change right on the frontline.

Encourage managers, interns, pretty much anyone, to mash up tech with wild ideas in their day-to-day and watch the ownership spread. 

We stand at the crossroads, holding a rare moment where policy can tip the balance toward people or toward code. The choice of landing squarely in human hands still looks daunting, but nobody gets dragged through this blindly. Rethink talent models so skill, spirit, and technology line up instead of running in separate lanes.

If those pieces fit together, the productivity spike follows, and so does the business value everyone keeps talking about. Skip that realignment, and the same tools that promise freedom end up sharpening the very collars we said were gone.

Career and management implications 

The workplace is changing in ways that ripple well beyond the latest technology demo. Managers now need to rethink what authority even means when AIs pull as much weight as people do.

Old command-and-control hierarchies simply don’t fit. Collaboration, trial-and-error, and plain visibility in how algorithms make decisions matter far more. McKinsey puts it bluntly: bold AI targets must drive new structures, fresh incentives, and tougher accountability rules. Product, ops, and data leaders often end up elbow-deep together, swapping insights on the fly until a working prototype surfaces. That blend feels messy, but it works. 

Careers are reshaping themselves right alongside management practices. Few professionals will climb the same straight ladder their parents did. Instead, a T-shaped profile, deep chops in finance, and wide comfort with AI tools become the norm.

New titles like AI product owner land beside more familiar ones on org charts, and folks are expected to slide from one box to another without fuss. Learning plans now stack competencies; a marketer who takes an AI analytics boot camp, then masters model auditing, suddenly qualifies for a much bigger role.

Oracle insists virtually every job will soon add the phrase using generative AI and supervision thereof to its description, and the company is probably correct.

Talent management is headed toward a sharper, skills-first focus. Where once longevity or pedigree ruled performance reviews, nimbleness, a learn-on-the-go mindset, and the knack for working in messy teams will start to tip the scales.

The World Economic Forum is already calling this shift skills intelligence, a phrase that keeps popping up in boardrooms. Some firms are trying out real-time peer checks and milestone pay jumps: show the muscle, move up. A handful of trailblazers have even hooked up AI engines that nudge people toward fresh roles or courses based on what they have just mastered. 

The workplace of tomorrow, powered by ever-smarter tools, is anything but static. Most futurists agree machines won’t erase jobs so much as carve them into new shapes. To keep the workforce from feeling whipsawed, leaders must step in early and steer the transition.

That means backing learning routes, whether it’s funding an ML cert or bringing in coaches, and lavishing praise on the uniquely human spark that tech can’t mimic. One industry sage puts it bluntly: the people who win will be knowledge workers who wield AI deftly but never lose sight of crafting solutions that are durable, valuable, and, above all, humane.

What’s next?

The workplace of tomorrow will blend people and artificial intelligence in ways that feel ordinary before long. 

Analysts, designers, coaches-everyone who trades in knowledge-will spend less time pushing pixels or filling sheets and more on insight, judgment, and plain old human connection. 

Companies that show real leadership will retrain staff, re-architect roles, and rethink how managers ask questions and give credit. For those that pull it off, productivity will inch upward and, just maybe, the teams doing the work will feel a bit more alive in the process.

Turning structured data into ROI with genAITurning structured data into ROI with genAI

Turning structured data  into ROI with genAI

At GigaSpaces, we’ve been in the data management game for over twenty years. We specialize in mission-critical, real-time software solutions, and over the past two decades, we’ve seen just how essential structured data is, whether it resides in a traditional database, an Excel sheet, or a humble CSV file.

Every company, regardless of its size or industry, relies on structured data. Maybe it’s the bulk of their operations, maybe just a slice, but either way, the need for fast, reliable access to that data is universal. 

Of course, what “real-time” means varies depending on the business. For some, it’s milliseconds; for others, hours might do. However, the expectation remains the same: access must be seamless, fast, and dependable.

The reality of enterprise data management

Let’s talk about the real challenge: enterprise data is hard to work with.

Even when structured, it’s often fragmented across systems, stored in outdated databases, or locked behind poorly configured infrastructure. Many organizations are still running on databases built twenty or thirty years ago. And as anyone who’s tried knows, fixing those systems is a monumental task, often one attempted only once and never repeated. Once bitten, twice shy.

So, how do we give business users the access they need without overhauling everything?

That’s where things get complicated. Enterprises have layered on workaround after workaround: ETL pipelines, data warehouses, operational data stores, data lakes, caching layers, you name it. Each is a patch or workaround designed to move, manipulate, and surface data for reporting or analysis.

But every added layer introduces more complexity, more latency, and more chances for something to go wrong.

Why traditional BI is no longer enough

For years, Business Intelligence (BI) has been the go-to solution for helping users visualize and interpret data. Everyone here is familiar with it; you probably have a BI tool running right now.

But BI isn’t enough anymore.

While it serves a purpose, traditional BI platforms only show a limited slice of the full data picture. They’re constrained by what’s been extracted, transformed, and loaded into the data warehouse. If it doesn’t make it into the warehouse, it won’t appear in the dashboard. That means critical context and nuance often get lost.

Analysts today need more than just static reports. They want to slice and dice data, follow up with deeper questions, drill down into specifics, and do all of this without filing a ticket or waiting days for a response. The modern business user expects the ability to interact with data in real time, in the flow of work.

So, the question is: can we actually enable that?

The evolution toward smarter data access

We’re in the middle of a major shift. While BI isn’t going away, traditional reports still serve their purpose; we’re clearly moving into the next phase of data interaction.

Natural language processing (NLP), AI copilots, and more dynamic querying interfaces are emerging. The goal? To simplify access. Imagine this: connect directly to your database, ask a business question in plain English, and get an instant answer.

That’s the vision.

And to a surprising extent, we’re starting to see it come to life. Consider the rise of Retrieval-Augmented Generation (RAG). How many of your companies are already experimenting with RAG? From what we’ve seen, that’s about 60–70%.

RAG is an exciting technique, especially when dealing with unstructured or semi-structured data. But let’s park that for now. We’ll return to it shortly.

AI-powered NLP enhances legal aid at Justice Connect
Justice Connect uses NLP technology to improve efficiency and provide faster legal aid to disadvantaged individuals in Australia.
Turning structured data  into ROI with genAI

Just ask: Making data truly accessible through NLQ

At GigaSpaces, our motto is simple: just ask.

We believe business users, whether they’re technical, semi-technical, or purely business-oriented, should be able to ask a question and get an answer instantly. If a CEO is heading into a board meeting and needs data on performance, risk, or opportunity, they should be able to ask for it directly.

Natural language querying (NLQ) makes this possible.

Imagine asking: What are my high-risk portfolios? Or: Show me client investment distribution. Or: How are we performing on compliance monitoring? No SQL, no dashboards: just a question, and an answer.

Interestingly, one of our recent prospects was from procurement. They weren’t the obvious audience for a data tool, but once they saw what NLQ could do, they wanted in. Why? Because they needed to compare vendor pricing, pulling internal data and matching it against public sources. It turns out, everyone in the organization wants fast, intelligent access to data.

Technology is great, but business value comes first

Let’s start with something even more important than the technology: business value.

As technologists, it’s easy to get swept up in the excitement of new tools. We play, we experiment, we test with R&D. But at the end of the day, what really matters is this: does it deliver value to the business?

If 80% of the organization adopts a tool, that’s great, but only if that adoption translates into measurable outcomes. Are we saving time? Reducing costs? Increasing decision velocity?

Too many tools are “nice to have.” They make your day 1% easier, but that’s not enough to justify the investment. With NLQ and technologies like RAG, we’re not just adding convenience. We’re flipping the paradigm.

With eRAG we’re turning everyday users into power users by letting them interact with data directly. That’s a big deal, especially when most organizations are still stuck in the mindset of “we’ve got a few reports, it is what it is.”

RAG and similar techniques are changing that. They’re making data feel accessible again. But here’s the catch: most RAG implementations are built on unstructured or semi-structured data, and the results aren’t real-time. You vectorize data, you query it, but you’re essentially querying yesterday’s data.

That’s fine for some use cases. But for healthcare, asset management, or retail? Yesterday’s data isn’t good enough. In those domains, a delay of even an hour can be too late.

So, how do we bridge that gap?

Applications of deep learning in healthcare
Jonathan Rubin, Senior Scientist at Philips Research, outlines various applications of ML & DL in healthcare, emphasizing their unique benefits.
Turning structured data  into ROI with genAI

Beyond RAG: Table-augmented generation and metadata intelligence

There is a better way.

One emerging approach is what some are calling Table-Augmented Generation (TAG). Think of it as applying the principles of RAG, but over structured metadata. We’re talking about vectorizing metadata, using graph RAG to identify relationships and connections, even between tables that aren’t explicitly linked.

It’s not just clever; it’s practical. Behind the scenes, we’re layering in traditional and semantic caching, schema linking, and building a semantic layer that stretches across multiple databases. Users can connect to two, three, or even fifty databases and build a unified semantic map without accessing the raw data.

And no, we’re not building a catalog or implementing MDM. If you’ve ever tried that, you know it’s a nightmare. This isn’t about solving the entire organization’s data taxonomy. It’s about solving for each business unit individually, allowing them to work in their own language, with their own vocabulary and semantics.

This flexibility is key, and yes, AI governance and security are baked in. That’s a whole topic on its own, but worth noting here: it’s not an afterthought.

The product behind all this is something we call Enterprise RAG, or eRAG. It exposes an API that users can integrate directly or call via REST. It’s model-agnostic, cloud-agnostic, and it just works. Check it out in more detail here.

Implementing a semantic layer that learns from users

Here’s the kicker: the solution is SaaS. Whether your data resides on-premises or in the cloud, we connect, extract the metadata, and build a semantic layer using five to seven behind-the-scenes techniques to optimize for comprehension and usability.

From the user’s point of view? All they have to do is ask a question.

Even better, those questions help train the system. When users respond with feedback, positive or negative, it fine-tunes the semantic layer. If something’s off, they can simply say so, in natural language, and the platform adapts.

This isn’t a developer tool. It’s not a Python library. It’s a human interface to structured data, and that’s where the magic is. Accuracy and simplicity, combined.

Whether you choose to build this kind of system yourself or opt for a ready-to-go solution, usability is key.

Final thoughts

As enterprises wrestle with fragmented data and rising expectations for speed and accessibility, the future of data management is clear: it’s about empowering every user to get answers in real time, without layers of complexity in the way. 

Technologies like NLQ, TAG, and Enterprise RAG are shifting the focus from infrastructure to impact, turning data from a bottleneck into a true business enabler. The path forward isn’t just about adopting smarter tools; it’s about reimagining how people and data interact, so that insight is always just a question away.

Ready to turn your data into answers? Discover how eRAG and NLQ can unlock real-time insight for your team. Reach out to learn more or see it in action.

How TigerEye is redefining AI-powered business intelligenceHow TigerEye is redefining AI-powered business intelligence

How TigerEye is redefining  AI-powered business intelligence

At the Generative AI Summit in Silicon Valley, Ralph Gootee, Co-founder of TigerEye, joined Tim Mitchell, Business Line Lead, Technology at the AI Accelerator Institute, to discuss how AI is transforming business intelligence for go-to-market teams.

In this interview, Ralph shares lessons learned from building two companies and explores how TigerEye is rethinking business intelligence from the ground up with AI, helping organizations unlock reliable, actionable insights without wasting resources on bespoke analytics.

Tim Mitchell: Ralph, it’s a pleasure to have you here. We’re on day two of the Generative AI Summit, part of AI Silicon Valley. You’re a huge part of the industry in Silicon Valley, so it’s amazing to have you join us. TigerEye is here as part of the event. Maybe for folks that aren’t familiar with the brand, you can just give a quick rundown of who you are and what you’re doing.

Ralph: I’m the co-founder of TigerEye – my second company. It’s exciting to be solving some of the problems we had with our first company, PlanGrid, in this one. We sold PlanGrid to Autodesk. I had a really good time building it. But when you’re building a company, you end up having many internal metrics to track, and a lot of things that happen with sales. So, we built a data team.

With TigerEye, we’re using AI to help build that data team for other companies, so they can learn from our past mistakes. We’re helping them build business intelligence that’s meant for go-to-market, so sales, marketing, and finance all together in one package.

Lessons learned from PlanGrid

Tim: What were some of those mistakes that you’re now helping others avoid?

Ralph: The biggest one was using highly skilled resources to build internal analytics, time that could’ve gone into building customer-facing features. We had talented data engineers figuring out sales metrics instead of enhancing our product. That’s a key learning we bring to TigerEye.

What makes TigerEye unique

Tim: If I can describe TigerEye in short as an AI analyst for business intelligence, what’s unique about TigerEye in that space?

Ralph: One of the things that’s unique is we were built from the ground up for AI. Where a lot of other companies are trying to tack on or figure out how they’re going to work with AI, TigerEye was built in generative AI as a world. Rather than relying on text or trying to gather up metrics that could cause hallucination, we actually write SQL from the bottom up. Our platform is built on SQL, so we can give answers that show your math. You can see why the win rate is that, and it will decrease over time.

Why Generative AI Summit matters

Tim: And what’s interesting about this conference for you?

Ralph: The conference brings together both big companies and startups. It’s really nice to have conversations with companies that have more mature data issues, versus startups that are just figuring out how their sales motions work.

The challenges of roadmapping in AI

Tim: You’re the co-founder, but as CTO, in what kind of capacity does the roadmapping cause you headaches? What does that process look like for a solution like this?

Ralph: In the AI world, roadmapping is challenging because it keeps getting so much better so quickly. The only thing you know for sure is you’re going to have a new model drop that really moves things forward. Thankfully for us, we solve what we see as the hardest part of AI, giving 100% accurate answers. We still haven’t seen foundational models do that on their own, but they get much better at writing code.

So the way we’ve taught to write SQL, and how we work with foundational models, both go into the roadmap. Another part is what foundational models we support. Right now, we work with OpenAI, Gemini, and Anthropic. Every time there’s a new model drop, we evaluate it and think about whether we want to bring that in.

Evaluating and choosing models

Tim: How do you choose which model to use?

Ralph: There are two major things. One, we have a full evaluation framework. Since we specialize in sales questions, we’ve seen thousands of sales questions, and we know what the answer should be and how to write the code for them. We run new models through that and see how they look.

The other is speed. Latency really matters; people want instant responses. Sometimes, even within the same vendor, the speed will vary model by model, but that latency is important.

The future of AI-powered business intelligence

Tim: What’s next for you guys? Any AI-powered revelations we can expect?

Ralph: We think AI is going to be solved first in business intelligence in deep vertical sections. It’s hard to imagine AI solving a Shopify company’s challenge and also a supply chain challenge for an enterprise. We’re going deep into verticals to see what new features AI has to understand.

For example, in sales, territory management is a big challenge: splitting up accounts, segmenting business. We’re teaching AI how to optimize territory distribution and have those conversations with our customers. That’s where a lot of our roadmap is right now.

Who’s adopting AI business intelligence?

Tim: With these new products, who are you seeing the biggest wins with?

Ralph: Startups and mid-market have a good risk tolerance for AI products. Enterprises, we can have deep conversations, but it’s a slower process. They’re forming their strategic AI teams but not getting deep into it yet. Startups and mid-market, especially AI companies themselves, are going full-bore.

Tim: And what are the risks or doubts that enterprises might have?

Ralph: Most enterprises have multiple AI teams, and they don’t even know it. It happened out of nowhere. Then they realize they need an AI visionary to lead those teams. The AI visionary is figuring out their job, and the enterprise is going through that process.

The best enterprises focus on delivering more value to their customers with fewer resources. We’re seeing that trend – how do I get my margins up and lower my costs?

Final thoughts

As AI continues to reshape business intelligence, it’s clear that success will come to those who focus on practical, reliable solutions that serve real go-to-market needs. 

TigerEye’s approach, combining AI’s power with transparent, verifiable analytics, offers a glimpse into the future of business intelligence: one where teams spend less time wrestling with data and more time acting on insights. 

As the technology evolves, the companies that go deep into vertical challenges and stay laser-focused on customer value will be the ones leading the charge.

Why agentic AI pilots fail and how to scale safelyWhy agentic AI pilots fail and how to scale safely

Why agentic AI pilots fail and how to scale safely

At the AI Accelerator Institute Summit in New York, Oren Michels, Co-founder and CEO of Barndoor AI, joined a one-on-one discussion with Alexander Puutio, Professor and Author, to explore a question facing every enterprise experimenting with AI: Why do so many AI pilots stall, and what will it take to unlock real value?

Barndoor AI launched in May 2025. Its mission addresses a gap Oren has seen over decades working in data access and security: how to secure and manage AI agents so they can deliver on their promise in enterprise settings.

“What you’re really here for is the discussion about AI access,” he told the audience. “There’s a real need to secure AI agents, and frankly, the approaches I’d seen so far didn’t make much sense to me.”

AI pilots are being built, but Oren was quick to point out that deployment is where the real challenges begin.

As Alexander noted:

“If you’ve been around AI, as I know everyone here has, you’ve seen it. There are pilots everywhere…”

Why AI pilots fail

Oren didn’t sugarcoat the current state of enterprise AI pilots:

“There are lots of them. And many are wrapping up now without much to show for it.”

Alexander echoed that hard truth with a personal story. In a Forbes column, he’d featured a CEO who was bullish on AI, front-loading pilots to automate calendars and streamline doctor communications. But just three months later, the same CEO emailed him privately:

“Alex, I need to talk to you about the pilot.”

The reality?

“The whole thing went off the rails. Nothing worked, and the vendor pulled out.”

Why is this happening? According to Oren, it starts with a misconception about how AI fits into real work:

“When we talk about AI today, people often think of large language models, like ChatGPT. And that means a chat interface.”

But this assumption is flawed.

“That interface presumes that people do their jobs by chatting with a smart PhD about what to do. That’s just not how most people work.”

Oren explained that most employees engage with specific tools and data. They apply their training, gather information, and produce work products. That’s where current AI deployments miss the mark, except in coding:

“Coding is one of those rare jobs where you do hand over your work to a smart expert and say, ‘Here’s my code, it’s broken, help me fix it.’ LLMs are great at that. But for most functions, we need AI that engages with tools the way people do, so it can do useful, interesting work.”

Why agentic AI pilots fail and how to scale safely

The promise of agents and the real bottleneck

Alexander pointed to early agentic AI experiments, like Devin, touted as the first AI software engineer:

“When you actually looked at what the agent did, it didn’t really do that much, right?”

Oren agreed. The issue wasn’t the technology; it was the disconnect between what people expect agents to do and how they actually work:

“There’s this promise that someone like Joe in finance will know how to tell an agent to do something useful. Joe’s probably a fantastic finance professional, but he’s not part of that subset who knows how to instruct computers effectively.”

He pointed to Zapier as proof: a no-code tool that didn’t replace coders.

“The real challenge isn’t just knowing how to code. It’s seeing these powerful tools, understanding the business problems, and figuring out how to connect the two. That’s where value comes from.”

And too often, Oren noted, companies think money alone will solve it. CEOs invest heavily and end up with nothing to show because:

“Maybe the human process, or how people actually use these tools, just isn’t working.”

This brings us to what Oren called the real bottleneck: access, not just to AI, but what AI can access.

“We give humans access based on who they are, what they’re doing, and how much we trust them. But AI hasn’t followed that same path. Just having AI log in like a human and click around isn’t that interesting; that’s just scaled-up robotic process automation.”

Instead, enterprises need to define:

  • What they trust an agent to do
  • The rights of the human behind it
  • The rules of the system it’s interacting with
  • And the specific task at hand

These intersect to form what Oren called a multi-dimensional access problem:

“Without granular controls, you end up either dialing agents back so much they’re less useful than humans, or you risk over-permissioning. The goal is to make them more useful than humans.”

Why specialized agents are the future (and how to manage the “mess”)

As the conversation shifted to access, Alexander posed a question many AI leaders grapple with: When we think about role- and permission-based access, are we really debating the edges of agentic AI?

“Should agents be able to touch everything, like deleting Salesforce records, or are we heading toward hyper-niche agents?”

Oren was clear on where he stands:

“I’d be one of those people making the case for niche agents. It’s the same as how we hire humans. You don’t hire one person to do everything. There’s not going to be a single AI that rules them all, no matter how good it is.”

Instead, as companies evolve, they’ll seek out specialized tools, just like they hire specialized people.

“You wouldn’t hire a bunch of generalists and hope the company runs smoothly. The same will happen with agents.”

But with specialization comes complexity. Alexander put it bluntly:

“How do we manage the mess? Because, let’s face it, there’s going to be a mess.”

Oren welcomed that reality:

“The mess is actually a good thing. We already have it with software. But you don’t manage it agent by agent, there will be way too many.”

The key is centralized management:

  • A single place to manage all agents
  • Controls based on what agents are trying to do, and the role of the human behind them
  • System-specific safeguards, because admins (like your Salesforce or HR lead) need to manage what’s happening in their domain

“If each agent or its builder had its own way of handling security, that wouldn’t be sustainable. And you don’t want agents or their creators deciding their own security protocols – that’s probably not a great idea.”

Why agentic AI pilots fail and how to scale safely

Why AI agents need guardrails and onboarding

The question of accountability loomed large. When humans manage fleets of AI agents, where does responsibility sit?

Oren was clear:

“There’s human accountability. But we have to remember: humans don’t always know what the agents are going to do, or how they’re going to do it. If we’ve learned anything about AI so far, it’s that it can have a bit of a mind of its own.”

He likened agents to enthusiastic interns – eager to prove themselves, sometimes overstepping in their zeal:

“They’ll do everything they can to impress. And that’s where guardrails come in. But it’s hard to build those guardrails inside the agent. They’re crafty. They’ll often find ways around internal limits.”

The smarter approach? Start small:

  • Give agents a limited scope.
  • Watch their behavior.
  • Extend trust gradually, just as you would with a human intern who earns more responsibility over time.

This led to the next logical step: onboarding. Alexander asked whether bringing in AI agents is like an HR function.

Oren agreed and shared a great metaphor from Nvidia’s Jensen Huang:

“You have your biological workforce, managed by HR, and your agent workforce, managed by IT.”

Just as companies use HR systems to manage people, they’ll need systems to manage, deploy, and train AI agents so they’re efficient and, as Alexander added, safe.

How to manage AI’s intent

Speed is one of AI’s greatest strengths and risks. As Oren put it:

“Agents are, at their core, computers, and they can do things very, very fast. One CISO I know described it perfectly: she wants to limit the blast radius of the agents when they come in.”

That idea resonated. Alexander shared a similar reflection from a security company CEO:

“AI can sometimes be absolutely benevolent, no problem at all, but you still want to track who’s doing what and who’s accessing what. It could be malicious. Or it could be well-intentioned but doing the wrong thing.”

Real-world examples abound from models like Anthropic’s Claude “snitching” on users, to AI trying to protect its own code base in unintended ways.

So, how do we manage the intent of AI agents?

Oren drew a striking contrast to traditional computing:

“Historically, computers did exactly what you told them; whether that’s what you wanted or not. But that’s not entirely true anymore. With AI, sometimes they won’t do exactly what you tell them to.”

That makes managing them a mix of art and science. And, as Oren pointed out, this isn’t something you can expect every employee to master:

“It’s not going to be Joe in finance spinning up an agent to do their job. These tools are too powerful, too complex. Deploying them effectively takes expertise.”

Why pilots stall and how innovation spreads

If agents could truly do it all, Oren quipped:

“They wouldn’t need us here, they’d just handle it all on their own.”

But the reality is different. When Alexander asked about governance failures, Oren pointed to a subtle but powerful cause of failure. Not reckless deployments, but inertia:

“The failure I see isn’t poor governance in action, it’s what’s not happening. Companies are reluctant to really turn these agents loose because they don’t have the visibility or control they need.”

The result? Pilot projects that go nowhere.

“It’s like hiring incredibly talented people but not giving them access to the tools they need to do their jobs and then being disappointed with the results.”

In contrast, successful AI deployments come from open organizations that grant broader access and trust. But Oren acknowledged the catch:

“The larger you get as a company, the harder it is to pull off. You can’t run a large enterprise that way.”

So, where does innovation come from?

“It’s bottom-up, but also outside-in. You’ll see visionary teams build something cool, showcase it, and suddenly everyone wants it. That’s how adoption spreads, just like in the API world.”

And to bring that innovation into safe, scalable practice:

  • Start with governance and security so people feel safe experimenting.
  • Engage both internal teams and outside experts.
  • Focus on solving real business problems, not just deploying tech for its own sake.

Oren put it bluntly:

“CISOs and CTOs, they don’t really have an AI problem. But the people creating products, selling them, managing finance – they need AI to stay competitive.”

Why agentic AI pilots fail and how to scale safely

Trusting AI from an exoskeleton to an independent agent

The conversation circled back to a critical theme: trust.

Alexander shared a reflection that resonated deeply:

“Before ChatGPT, the human experience with computers was like Excel: one plus one is always two. If something went wrong, you assumed it was your mistake. The computer was always right.”

But now, AI behaves in ways that can feel unpredictable, even untrustworthy. What does that mean for how we work with it?

Oren saw this shift as a feature, not a flaw:

“If AI were completely linear, you’d just be programming, and that’s not what AI is meant to be. These models are trained on the entirety of human knowledge. You want them to go off and find interesting, different ways of looking at problems.”

The power of AI, he argued, comes not from treating it like Google, but from engaging it in a process:

“My son works in science at a biotech startup in Denmark. He uses AI not to get the answer, but to have a conversation about how to find the answer. That’s the mindset that leads to success with AI.”

And that mindset extends to gradual trust:

“Start by assigning low-risk tasks. Keep a human in the loop. As the AI delivers better results over time, you can reduce that oversight. Eventually, for certain tasks, you can take the human out of the loop.”

Oren summed it up with a powerful metaphor:

“You start with AI as an exoskeleton; it makes you bigger, stronger, faster. And over time, it can become more like the robot that does the work itself.”

The spectrum of agentic AI and why access controls are key

Alexander tied the conversation to a helpful analogy from a JP Morgan CTO: agentic AI isn’t binary.

“There’s no clear 0 or 1 where something is agentic or isn’t. At one end, you have a fully trusted system of agents. On the other hand, maybe it’s just a one-shot prompt or classic RPA with a bit of machine learning on top.”

Oren agreed:

“You’ve described the two ends of the spectrum perfectly. And with all automation, the key is deciding where on that spectrum we’re comfortable operating.”

He compared it to self-driving cars:

“Level 1 is cruise control; Level 5 is full autonomy. We’re comfortable somewhere in the middle right now. It’ll be the same with agents. As they get better, and as we get better at guiding them, we’ll move further along that spectrum.”

And how do you navigate that safely? Oren returned to the importance of access controls:

“When you control access outside the agent layer, you don’t have to worry as much about what’s happening inside. The agent can’t see or write to anything it isn’t allowed to.”

That approach offers two critical safeguards:

  • It prevents unintended actions.
  • It provides visibility into attempts, showing when an agent tries to do something it shouldn’t, so teams can adjust the instructions before harm is done.

“That lets you figure out what you’re telling it that’s prompting that behavior, without letting it break anything.”

The business imperative and the myth of the chat interface

At the enterprise level, Oren emphasized that the rise of the Chief AI Officer reflects a deeper truth:

“Someone in the company recognized that we need to figure this out to compete. Either you solve this before your competitors and gain an advantage, or you fall behind.”

And that, Oren stressed, is why this is not just a technology problem, it’s a business problem:

“You’re using technology, but you’re solving business challenges. You need to engage the people who have the problems, and the folks solving them, and figure out how AI can make that more efficient.”

When Alexander asked about the biggest myth in AI enterprise adoption, Oren didn’t hesitate:

“That the chat interface will win.”

While coders love chat interfaces because they can feed in code and get help most employees don’t work that way:

“Most people don’t do their jobs through chat-like interaction. And most don’t know how to use a chat interface effectively. They see a box, like Google search, and that doesn’t work well with AI.”

He predicted that within five years, chat interfaces will be niche. The real value?

“Agents doing useful things behind the scenes.”

How to scale AI safely

Finally, in response to a closing question from Alexander, Oren offered practical advice for enterprises looking to scale AI safely:

“Visibility is key. We don’t fully understand what happens inside these models; no one really does. Any tool that claims it can guarantee behavior inside the model? I’m skeptical.”

Instead, Oren urged companies to focus on where they can act:

“Manage what goes into the tools, and what comes out. Don’t believe you can control what happens within them.”

Final thoughts

As enterprises navigate the complex realities of AI adoption, one thing is clear: success won’t come from chasing hype or hoping a chat interface will magically solve business challenges. 

It will come from building thoughtful guardrails, designing specialized agents, and aligning AI initiatives with real-world workflows and risks. The future belongs to companies that strike the right balance; trusting AI enough to unlock its potential, but governing it wisely to protect their business. 

The path forward isn’t about replacing people; it’s about empowering them with AI that truly works with them, not just beside them.

CAP theorem in ML: Consistency vs. availabilityCAP theorem in ML: Consistency vs. availability

CAP theorem in ML:  Consistency vs. availability

The CAP theorem has long been the unavoidable reality check for distributed database architects. However, as machine learning (ML) evolves from isolated model training to complex, distributed pipelines operating in real-time, ML engineers are discovering that these same fundamental constraints also apply to their systems. What was once considered primarily a database concern has become increasingly relevant in the AI engineering landscape.

Modern ML systems span multiple nodes, process terabytes of data, and increasingly need to make predictions with sub-second latency. In this distributed reality, the trade-offs between consistency, availability, and partition tolerance aren’t academic — they’re engineering decisions that directly impact model performance, user experience, and business outcomes.

This article explores how the CAP theorem manifests in AI/ML pipelines, examining specific components where these trade-offs become critical decision points. By understanding these constraints, ML engineers can make better architectural choices that align with their specific requirements rather than fighting against fundamental distributed systems limitations.

Quick recap: What is the CAP theorem?

The CAP theorem, formulated by Eric Brewer in 2000, states that in a distributed data system, you can guarantee at most two of these three properties simultaneously:

  • Consistency: Every read receives the most recent write or an error
  • Availability: Every request receives a non-error response (though not necessarily the most recent data)
  • Partition tolerance: The system continues to operate despite network failures between nodes

Traditional database examples illustrate these trade-offs clearly:

  • CA systems: Traditional relational databases like PostgreSQL prioritize consistency and availability but struggle when network partitions occur.
  • CP systems: Databases like HBase or MongoDB (in certain configurations) prioritize consistency over availability when partitions happen.
  • AP systems: Cassandra and DynamoDB favor availability and partition tolerance, adopting eventual consistency models.

What’s interesting is that these same trade-offs don’t just apply to databases — they’re increasingly critical considerations in distributed ML systems, from data pipelines to model serving infrastructure.

The great web rebuild: Infrastructure for the AI agent era
AI agents require rethinking trust, authentication, and security—see how Agent Passports and new protocols will redefine online interactions.
CAP theorem in ML:  Consistency vs. availability

Where the CAP theorem shows up in ML pipelines

Data ingestion and processing

The first stage where CAP trade-offs appear is in data collection and processing pipelines:

Stream processing (AP bias): Real-time data pipelines using Kafka, Kinesis, or Pulsar prioritize availability and partition tolerance. They’ll continue accepting events during network issues, but may process them out of order or duplicate them, creating consistency challenges for downstream ML systems.

Batch processing (CP bias): Traditional ETL jobs using Spark, Airflow, or similar tools prioritize consistency — each batch represents a coherent snapshot of data at processing time. However, they sacrifice availability by processing data in discrete windows rather than continuously.

This fundamental tension explains why Lambda and Kappa architectures emerged — they’re attempts to balance these CAP trade-offs by combining stream and batch approaches.

Feature Stores

Feature stores sit at the heart of modern ML systems, and they face particularly acute CAP theorem challenges.

Training-serving skew: One of the core features of feature stores is ensuring consistency between training and serving environments. However, achieving this while maintaining high availability during network partitions is extraordinarily difficult.

Consider a global feature store serving multiple regions: Do you prioritize consistency by ensuring all features are identical across regions (risking unavailability during network issues)? Or do you favor availability by allowing regions to diverge temporarily (risking inconsistent predictions)?

Model training

Distributed training introduces another domain where CAP trade-offs become evident:

Synchronous SGD (CP bias): Frameworks like distributed TensorFlow with synchronous updates prioritize consistency of parameters across workers, but can become unavailable if some workers slow down or disconnect.

Asynchronous SGD (AP bias): Allows training to continue even when some workers are unavailable but sacrifices parameter consistency, potentially affecting convergence.

Federated learning: Perhaps the clearest example of CAP in training — heavily favors partition tolerance (devices come and go) and availability (training continues regardless) at the expense of global model consistency.

Model serving

When deploying models to production, CAP trade-offs directly impact user experience:

Hot deployments vs. consistency: Rolling updates to models can lead to inconsistent predictions during deployment windows — some requests hit the old model, some the new one.

A/B testing: How do you ensure users consistently see the same model variant? This becomes a classic consistency challenge in distributed serving.

Model versioning: Immediate rollbacks vs. ensuring all servers have the exact same model version is a clear availability-consistency tension.

Superintelligent language models: A new era of artificial cognition
The rise of large language models (LLMs) is pushing the boundaries of AI, sparking new debates on the future and ethics of artificial general intelligence.
CAP theorem in ML:  Consistency vs. availability

Case studies: CAP trade-offs in production ML systems

Real-time recommendation systems (AP bias)

E-commerce and content platforms typically favor availability and partition tolerance in their recommendation systems. If the recommendation service is momentarily unable to access the latest user interaction data due to network issues, most businesses would rather serve slightly outdated recommendations than no recommendations at all.

Netflix, for example, has explicitly designed its recommendation architecture to degrade gracefully, falling back to increasingly generic recommendations rather than failing if personalization data is unavailable.

Healthcare diagnostic systems (CP bias)

In contrast, ML systems for healthcare diagnostics typically prioritize consistency over availability. Medical diagnostic systems can’t afford to make predictions based on potentially outdated information.

A healthcare ML system might refuse to generate predictions rather than risk inconsistent results when some data sources are unavailable — a clear CP choice prioritizing safety over availability.

Edge ML for IoT devices (AP bias)

IoT deployments with on-device inference must handle frequent network partitions as devices move in and out of connectivity. These systems typically adopt AP strategies:

  • Locally cached models that operate independently
  • Asynchronous model updates when connectivity is available
  • Local data collection with eventual consistency when syncing to the cloud

Google’s Live Transcribe for hearing impairment uses this approach — the speech recognition model runs entirely on-device, prioritizing availability even when disconnected, with model updates happening eventually when connectivity is restored.

Strategies to balance CAP in ML systems

Given these constraints, how can ML engineers build systems that best navigate CAP trade-offs?

Graceful degradation

Design ML systems that can operate at varying levels of capability depending on data freshness and availability:

  • Fall back to simpler models when real-time features are unavailable
  • Use confidence scores to adjust prediction behavior based on data completeness
  • Implement tiered timeout policies for feature lookups

DoorDash’s ML platform, for example, incorporates multiple fallback layers for their delivery time prediction models — from a fully-featured real-time model to progressively simpler models based on what data is available within strict latency budgets.

Hybrid architectures

Combine approaches that make different CAP trade-offs:

  • Lambda architecture: Use batch processing (CP) for correctness and stream processing (AP) for recency
  • Feature store tiering: Store consistency-critical features differently from availability-critical ones
  • Materialized views: Pre-compute and cache certain feature combinations to improve availability without sacrificing consistency

Uber’s Michelangelo platform exemplifies this approach, maintaining both real-time and batch paths for feature generation and model serving.

Consistency-aware training

Build consistency challenges directly into the training process:

  • Train with artificially delayed or missing features to make models robust to these conditions
  • Use data augmentation to simulate feature inconsistency scenarios
  • Incorporate timestamp information as explicit model inputs

Facebook’s recommendation systems are trained with awareness of feature staleness, allowing the models to adjust predictions based on the freshness of available signals.

Intelligent caching with TTLs

Implement caching policies that explicitly acknowledge the consistency-availability trade-off:

  • Use time-to-live (TTL) values based on feature volatility
  • Implement semantic caching that understands which features can tolerate staleness
  • Adjust cache policies dynamically based on system conditions
How to build autonomous AI agent with Google A2A protocol
How to build autonomous AI agent with Google A2A protocol, Google Agent Development Kit (ADK), Llama Prompt Guard 2, Gemma 3, and Gemini 2.0 Flash.
CAP theorem in ML:  Consistency vs. availability

Design principles for CAP-aware ML systems

Understand your critical path

Not all parts of your ML system have the same CAP requirements:

  1. Map your ML pipeline components and identify where consistency matters most vs. where availability is crucial
  2. Distinguish between features that genuinely impact predictions and those that are marginal
  3. Quantify the impact of staleness or unavailability for different data sources

Align with business requirements

The right CAP trade-offs depend entirely on your specific use case:

  • Revenue impact of unavailability: If ML system downtime directly impacts revenue (e.g., payment fraud detection), you might prioritize availability
  • Cost of inconsistency: If inconsistent predictions could cause safety issues or compliance violations, consistency might take precedence
  • User expectations: Some applications (like social media) can tolerate inconsistency better than others (like banking)

Monitor and observe

Build observability that helps you understand CAP trade-offs in production:

  • Track feature freshness and availability as explicit metrics
  • Measure prediction consistency across system components
  • Monitor how often fallbacks are triggered and their impact

Wondering where we’re headed next?

Our in-person event calendar is packed with opportunities to connect, learn, and collaborate with peers and industry leaders. Check out where we’ll be and join us on the road.

AI Accelerator Institute | Summit calendar
Unite with applied AI’s builders & execs. Join Generative AI Summit, Agentic AI Summit, LLMOps Summit & Chief AI Officer Summit in a city near you.
CAP theorem in ML:  Consistency vs. availability

How to build autonomous AI agent with Google A2A protocolHow to build autonomous AI agent with Google A2A protocol

Why do we need autonomous AI agents?

How to build autonomous AI agent with Google A2A protocol

Picture this: it’s 3 a.m., and a customer on the other side of the globe urgently needs help with their account. A traditional chatbot would wake up your support team with an escalation. But what if your AI agent could handle the request autonomously, safely, and correctly? That’s the dream, right?

The reality is that most AI agents today are like teenagers with learner’s permits; they need constant supervision. They might accidentally promise a customer a large refund (oops!) or fall for a clever prompt injection that makes them spill company secrets or customers’ sensitive data. Not ideal.

This is where Double Validation comes in. Think of it as giving your AI agent both a security guard at the entrance (input validation) and a quality control inspector at the exit (output validation). With these safeguards at a minimum in place, your agent can operate autonomously without causing PR nightmares.

How did I come up with the Double Validation idea?

These days, we hear a lot of talk about AI agents. I asked myself, “What is the biggest challenge preventing the widespread adoption of AI agents?” I concluded that the answer is trustworthy autonomy. When AI agents can be trusted, they can be scaled and adopted more readily. Conversely, if an agent’s autonomy is limited, it requires increased human involvement, which is costly and inhibits adoption.

Next, I considered the minimal requirements for an AI agent to be autonomous. I concluded that an autonomous AI agent needs, at minimum, two components:

  1. Input validation – to sanitize input, protect against jailbreaks, data poisoning, and harmful content.
  2. Output validation – to sanitize output, ensure brand alignment, and mitigate hallucinations.

I call this system Double Validation.

Given these insights, I built a proof-of-concept project to research the Double Validation concept.

In this article, we’ll explore how to implement Double Validation by building a multiagent system with the Google A2A protocol, the Google Agent Development Kit (ADK), Llama Prompt Guard 2, Gemma 3, and Gemini 2.0 Flash, and how to optimize it for production, specifically, deploying it on Google Vertex AI.

For input validation, I chose Llama Prompt Guard 2 just as an article about it reached me at the perfect time. I selected this model because it is specifically designed to guard against prompt injections and jailbreaks. It is also very small; the largest variant, Llama Prompt Guard 2 86M, has only 86 million parameters, so it can be downloaded and included in a Docker image for cloud deployment, improving latency. That is exactly what I did, as you’ll see later in this article.

The complete code for this project is available at github.com/alexey-tyurin/a2a-double-validation

How to build it?

The architecture uses four specialized agents that communicate through the Google A2A protocol, each with a specific role:

How to build autonomous AI agent with Google A2A protocol
Image generated by author

Here’s how each agent contributes to the system:

  1. Manager Agent: The orchestra conductor, coordinating the flow between agents
  2. Safeguard Agent: The bouncer, checking for prompt injections using Llama Prompt Guard 2
  3. Processor Agent: The worker bee, processing legitimate queries with Gemma 3
  4. Critic Agent: The editor, evaluating responses for completeness and validity using Gemini 2.0 Flash

I chose Gemma 3 for the Processor Agent because it is small, fast, and can be fine-tuned with your data if needed — an ideal candidate for production. Google currently supports nine (!) different frameworks or methods for finetuning Gemma; see Google’s documentation for details.

I chose Gemini 2.0 Flash for the Critic Agent because it is intelligent enough to act as a critic, yet significantly faster and cheaper than the larger Gemini 2.5 Pro Preview model. Model choice depends on your requirements; in my tests, Gemini 2.0 Flash performed well.

I deliberately used different models for the Processor and Critic Agents to avoid bias — an LLM may judge its own output differently from another model’s.

Let me show you the key implementation of the Safeguard Agent:

How to build autonomous AI agent with Google A2A protocol

Plan for actions

The workflow follows a clear, production-ready pattern:

  1. User sends query → The Manager Agent receives it.
  2. Safety check → The Manager forwards the query to the Safeguard Agent.
  3. Vulnerability assessment → Llama Prompt Guard 2 analyzes the input.
  4. Processing → If the input is safe, the Processor Agent handles the query with Gemma 3.
  5. Quality control → The Critic Agent evaluates the response.
  6. Delivery → The Manager Agent returns the validated response to the user.

Below is the Manager Agent’s coordination logic:

How to build autonomous AI agent with Google A2A protocol

Time to build it

Ready to roll up your sleeves? Here’s your production-ready roadmap:

Local deployment

1. Environment setup 

How to build autonomous AI agent with Google A2A protocol

2. Configure API keys 

How to build autonomous AI agent with Google A2A protocol

3. Download Llama Prompt Guard 2 

This is the clever part – we download the model once when we start Agent Critic for the first time and package it in our Docker image for cloud deployment:

How to build autonomous AI agent with Google A2A protocol

Important Note about Llama Prompt Guard 2: To use the Llama Prompt Guard 2 model, you must:

  1. Fill out the “LLAMA 4 COMMUNITY LICENSE AGREEMENT” at https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M
  2. Get your request to access this repository approved by Meta
  3. Only after approval will you be able to download and use this model

4. Local testing 

How to build autonomous AI agent with Google A2A protocol

Screenshot for running main.py

 

How to build autonomous AI agent with Google A2A protocol
Image generated by author

Screenshot for running client

 

How to build autonomous AI agent with Google A2A protocol
Image generated by author

Screenshot for running tests

 

How to build autonomous AI agent with Google A2A protocol
Image generated by author

Production Deployment 

Here’s where it gets interesting. We optimize for production by including the Llama model in the Docker image:

How to build autonomous AI agent with Google A2A protocol

1. Setup Cloud Project in Cloud Shell Terminal

  1. Access Google Cloud Console: Go to https://console.cloud.google.com
  2. Open Cloud Shell: Click the Cloud Shell icon (terminal icon) in the top right corner of the Google Cloud Console
  3. Authenticate with Google Cloud:
How to build autonomous AI agent with Google A2A protocol
  1. Create or select a project:
How to build autonomous AI agent with Google A2A protocol
  1. Enable required APIs:
How to build autonomous AI agent with Google A2A protocol

3. Setup Vertex AI Permissions

Grant your account the necessary permissions for Vertex AI and related services:

How to build autonomous AI agent with Google A2A protocol

3. Create and Setup VM Instance

Cloud Shell will not work for this project as Cloud Shell is limited to 5GB of disk space. This project needs more than 30GB of disk space to build Docker images, get all dependencies, and download the Llama Prompt Guard 2 model locally. So, you need to use a dedicated VM instead of Cloud Shell.

How to build autonomous AI agent with Google A2A protocol

4. Connect to VM

How to build autonomous AI agent with Google A2A protocol

Screenshot for VM 

How to build autonomous AI agent with Google A2A protocol
Image generated by author

5. Clone Repository

How to build autonomous AI agent with Google A2A protocol

6. Deployment Steps

How to build autonomous AI agent with Google A2A protocol

Screenshot for agents in cloud 

How to build autonomous AI agent with Google A2A protocol
Image generated by author

7. Testing 

How to build autonomous AI agent with Google A2A protocol

Screenshot for running client in Google Vertex AI

How to build autonomous AI agent with Google A2A protocol
Image generated by author
How to build autonomous AI agent with Google A2A protocol

Screenshot for running tests in Google Vertex AI

How to build autonomous AI agent with Google A2A protocol
Image generated by author

Alternatives to Solution

Let’s be honest – there are other ways to skin this cat:

  1. Single Model Approach: Use a large LLM like GPT-4 with careful system prompts
    • Simpler but less specialized
    • Higher risk of prompt injection
    • Risk of LLM bias in using the same LLM for answer generation and its criticism
  2. Monolith approach: Use all flows in just one agent
    • Latency is better
    • Cannot scale and evolve input validation and output validation independently
    • More complex code, as it is all bundled together
  3. Rule-Based Filtering: Traditional regex and keyword filtering
    • Faster but less intelligent
    • High false positive rate
  4. Commercial Solutions: Services like Azure Content Moderator or Google Model Armor
    • Easier to implement but less customizable
    • On contrary, Llama Prompt Guard 2 model can be fine-tuned with the customer’s data
    • Ongoing subscription costs
  5. Open-Source Alternatives: Guardrails AI or NeMo Guardrails
    • Good frameworks, but require more setup
    • Less specialized for prompt injection

Lessons Learned

1. Llama Prompt Guard 2 86M has blind spots. During testing, certain jailbreak prompts, such as:

How to build autonomous AI agent with Google A2A protocol

And

How to build autonomous AI agent with Google A2A protocol

were not flagged as malicious. Consider fine-tuning the model with domain-specific examples to increase its recall for the attack patterns that matter to you.

2. Gemini Flash model selection matters. My Critic Agent originally used gemini1.5flash, which frequently rated perfectly correct answers 4 / 5. For example:

How to build autonomous AI agent with Google A2A protocol

After switching to gemini2.0flash, the same answers were consistently rated 5 / 5:

How to build autonomous AI agent with Google A2A protocol

3. Cloud Shell storage is a bottleneck. Google Cloud Shell provides only 5 GB of disk space — far too little to build the Docker images required for this project, get all dependencies, and download the Llama Prompt Guard 2 model locally to deploy the Docker image with it to Google Vertex AI. Provision a dedicated VM with at least 30 GB instead.

Conclusion

Autonomous agents aren’t built by simply throwing the largest LLM at every problem. They require a system that can run safely without human babysitting. Double Validation — wrapping a task-oriented Processor Agent with dedicated input and output validators — delivers a balanced blend of safety, performance, and cost. 

Pairing a lightweight guard such as Llama Prompt Guard 2 with production friendly models like Gemma 3 and Gemini Flash keeps latency and budget under control while still meeting stringent security and quality requirements.

Join the conversation. What’s the biggest obstacle you encounter when moving autonomous agents into production — technical limits, regulatory hurdles, or user trust? How would you extend the Double Validation concept to high-risk domains like finance or healthcare?

Connect on LinkedIn: https://www.linkedin.com/in/alexey-tyurin-36893287/  

The complete code for this project is available at github.com/alexey-tyurin/a2a-double-validation

References

[1] Llama Prompt Guard 2 86M, https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M

[2] Google A2A protocol, https://github.com/google-a2a/A2A 

[3] Google Agent Development Kit (ADK), https://google.github.io/adk-docs/ 

Building & securing AI agents: A tech leader crash courseBuilding & securing AI agents: A tech leader crash course

Building & securing AI agents: A tech leader crash course

The AI revolution is racing beyond chatbots to autonomous agents that act, decide, and interface with internal systems.

Unlike traditional software, AI agents can be manipulated through language, making them vulnerable to attacks like prompt injection and they also introduce new security risks like excessive agency.

Join us for an exclusive deep dive with Sourabh Satish, CTO and co-founder at Pangea, as we explore the evolving landscape of AI agents and best practices for securing them.

This session covers:

  • Demos of MCP configuration and vulnerabilities to highlight how different architectures affect the agent’s attack surface.
  • An overview of existing security guardrails—from open source projects and cloud service provider offerings to commercial tools and DIY approaches.
  • A comparison of pros and cons across various guardrail solutions to help you choose the right approach for your use case.
  • Actionable best practices for implementing guardrails that secure your AI agents without slowing innovation.

This webinar is a must-attend for engineering leaders, AI engineers, and security leaders who want to understand and mitigate the risks of agentic software in an increasingly adversarial landscape.