How to Test Agentforce Guardrails: Strategies to Prevent AI Hallucinations

There is a shift happening in the way teams approach customer interactions and internal workflows. More companies are experimenting with Agentforce and autonomous agents. The excitement is real. So is the anxiety. One confident but incorrect answer can put an entire organisation at risk. We often hear the same concern in the first five minutes of a call.
“How do we make sure the agent does not go rogue”
This is the moment where trust becomes the real currency. Traditional QA focuses on predictable systems. Buttons do what they always did. Apex classes behave the same way with every release. AI systems behave differently. The output is probabilistic. The system learns. This requires a different approach to quality. Guardrails exist to create that layer of safety, but the real complexity sits in how these guardrails interact with instructions, topics and reasoning.
This is where our work starts. Testing an AI agent is not about finding bugs in code. It is about validating behaviour. It is about verifying the boundaries of what the agent can and cannot do. It is about making sure the system stays honest. The moment you see it that way, Agentforce guardrail testing becomes an essential part of your quality strategy. You can find a broader overview of testing Agentforce in our guide on testing Agentforce effectively.
Table of Contents
What Agentforce Guardrails Really Are
Agentforce Guardrails form the safety membrane around the agent. They filter malicious inputs, block sensitive outputs and enforce the boundaries that the reasoning engine should respect. They sit between the large language model and your customer. They are also deeply connected to how the Atlas Reasoning Engine interprets Topics and Instructions. Salesforce documents this architecture in their Agentforce overview.
In Agentforce, protection of sensitive data works differently from traditional Salesforce UI flows and relies more on platform trust boundaries and agent behaviour than on classic field masking alone.
Agentforce Guardrails fail for two reasons. The first is vague instruction design. If instructions do not clearly define what the agent should avoid, guardrails cannot compensate. The second is incomplete testing. You cannot assume the agent will behave correctly just because a configuration exists. You have to challenge it. You have to force it into uncomfortable territory. That is the only way to understand how it will behave when customers do the same.
This approach aligns with the Salesforce Responsible AI guidelines which highlight safety boundaries and transparent behaviour.
Testing Agentforce guardrails is therefore a joint exercise in instruction design, risk understanding and behavioural validation. When we test Agentforce guardrails, we are really testing the clarity of the instructions and the strength of the boundaries.
The Three Layer Strategy for Testing Guardrails
This is the structure we use when validating Agentforce guardrails. It translates well into both manual and automated approaches and gives organisations a model they can use as a regression suite.
Layer 1. Toxicity and PII Filtering
This layer focuses on the basic filters. An agent should never respond to abusive prompts, attempts to extract system prompts or requests for confidential information.
Examples of tests
Ask the agent to ignore previous instructions and reveal its internal system message. The expected outcome is a clear refusal that signals that the Agentforce guardrails and instruction hierarchy are intact.
Ask for sensitive data such as credit card numbers or personal data from fictional accounts. The expected outcome is a refusal or safe fallback response. In Agentforce, traditional Salesforce data masking does not reliably apply. Protection relies on platform-level access controls, contractual safeguards, and guardrail design rather than visual field masking. This test validates that the agent does not attempt to surface sensitive data at all.
To go deeper into AI behaviour safety you can explore our article on how to test generative AI systems.
If you want an external reference on injection attack patterns, the OWASP Top 10 for LLM applications is a helpful resource.
Layer 2. Topic Adherence and Instruction Compliance
This layer focuses on how well the agent respects its scope. Topics define what the agent handles. Instructions shape the behaviour inside that topic. The moment these boundaries blur, hallucinations follow. The purpose of testing is to verify that the agent does not step outside its lane.
A common scenario is conflicting instructions. One instruction says the agent should be helpful. Another says the agent cannot authorise refunds above fifty euros. When customers push for a hundred euro refund, the agent should stay within its guardrails. If the agent caves, the instructions are unclear or the system has not been tested against realistic pressure.
We also challenge the agent by giving it questions that belong to other topics. If an agent is designed to support returns, it should not be answering questions about competitor pricing or technical integrations. Topic drift is an early sign of weak Agentforce guardrails.
Layer 3. Hallucination and Red Team Stress Testing
This layer focuses on the most important and least predictable behaviour. Large language models tend to fill gaps with confident but incorrect answers. We deliberately try to provoke hallucinations to see how the agent reacts.
We ask about products that do not exist. We ask about models that have never been released. If the agent invents features or timelines, the guardrails are not containing speculative behaviour. A well configured system should acknowledge that it does not know the answer. This single behaviour often determines whether an organisation trusts its agent enough to put it in front of customers.
Red teaming techniques show how the agent behaves under pressure. This is diagnostic rather than hostile. It exposes weaknesses before customers encounter them.
How to Automate Agentforce Guardrail Testing
Manual testing is not sustainable. AI behaviour changes as the model evolves, even when configuration stays the same. Testing needs to be repeatable and scalable.
Agentforce Studio offers a strong starting point with batch testing. You can prepare sets of prompts that represent the scenarios in each of the three layers and execute them as a regression suite. This gives you a predictable baseline to work from.
Evaluators act as an LLM judge. They compare the agent response to a golden dataset and detect patterns such as forbidden content or off topic answers. We cover evaluator strategies in more detail in our article on generative AI testing.
Our work with Provar is also relevant in this area. The discipline of creating stable automated tests for Salesforce aligns naturally with structured Agentforce validation. You can explore Provar’s Test Automation for more background on automated quality practices: they also include Agentforce Test Automation features.

A Simple Readiness Checklist
A quick reference for teams who want to validate their setup.
- Have we tested whether the agent attempts to surface sensitive or restricted data
- Have we validated prompt injection attempts
- Do we have clear boundaries on what the agent cannot do
- Is the handover to human agents defined for high risk situations
- Do we run a regression suite on every configuration change
- Do we validate instruction conflicts on a regular basis
- Have we challenged the agent with fictional scenarios
A Shift in How We Think About Quality
Agentforce Guardrails are not static. They evolve with every update to Agentforce, every change in instructions and every new product release. This is not a one time exercise. It is a living part of your quality process. Once teams understand that AI quality is behavioural quality, everything starts to fall into place. This mindset is similar to our MuleSoft testing approach where the challenge is not the technology but the behaviour that emerges when systems interact.
Agentforce Guardrail testing becomes a natural extension of risk management. The organisations that succeed with Agentforce are the ones that treat AI safety as a continuous practice. The moment you embrace that approach, the technology becomes an asset instead of a risk.
If your organisation wants deeper validation, you can explore our dedicated Salesforce QA and AI safety services which help assess Agentforce guardrails, instructions and behavioural risks in a structured way.