Chatbot Testing Questions

What is chatbot testing?

Chatbot testing validates that an AI chatbot produces an expected output to a series of test prompts. The testing effort ensures that the chatbot delivers the promised user experience in a performant and secure manner. Generally, this is very difficult to do without automation, or can be extremely time consuming. For example, you might want to test that a customer-facing healthcare chatbot can accurately answer questions about how to alleviate symptoms from the common cold, but also will stay inline with its responsibilities and won’t try to diagnose your users with diseases. Testing is performed to:

  • Ensure that a chatbot generates relevant information that is correct and useful to the user.
  • Verify that the chatbot’s performance is acceptable when a bot is initially released.
  • Validate that a new chatbot release is not providing sub-standard response times compared to performance baselines established in prior releases.
  • Certify that the chatbot is secure and that the necessary guardrails are in place to prevent responses that contain harmful or confidential information.

How do I test a chatbot?

For effective chatbot testing, it is important to evaluate the chatbot using well-designed test cases and a repeatable process. In order to achieve this goal:

  • Test cases must be thoughtfully defined and organized into test suite
  • A testing baseline must be established for both expected functionality and chatbot performance. The bot response for each round of testing is evaluated against this established baseline.
  • The full suite of test cases must be executed after every change made to the chatbot to ensure that there is no negative impact on user experience.
  • A separate chatbot test environment must be created and maintained so that full regression test cycles can be applied to the chatbot after a change without impacting end users.

What are the best practices for chatbot testing?

A system that successfully tests AI chatbots must:

  • Ensure that a chatbot generates relevant information that is correct and useful to the user. This represents the functional testing component of chatbot validation. A test case is built for this type of testing for each expected class of user input.
  • Verify that the chatbot’s performance is acceptable when a bot is initially released.
  • Validate that a new chatbot release is not providing sub-standard response times compared to performance baselines established in prior releases.
  • Perform security testing to verify that the chatbot is secure and that the necessary guardrails are in place to prevent responses that contain harmful or confidential information.

How should chatbot test cases be defined?

AI chatbot test cases require a different approach from traditional UI testing because the LLMs that are used to power these types of bots typically do not provide identical responses to a prompt within a chatbot conversation. In order to build effective chatbot test cases:

  • A large Language Model should be used to evaluate chatbot conversations to determine if responses to user questions are effectively equivalent to the baseline, even when there are small variances in the response.
  • The testing system must allow for multiple baselines to be established to handle scenarios where there are multiple valid answers to a question.
  • The system must provide a mechanism for a human reviewer to provide feedback to the error management system. For example, a user may want to flag a response that was initially evaluated to be an error and convert it into a valid test baseline.
  • Each test case must support multiple variants since there are often many ways to form the same question using language. For example, a customer support system should provide essentially the same answers to the the questions “How do I use product X” and “I don’t understand how product X should be used”.

Why is automation essential to software testing for chatbots?

Optimization and security verification for any production-grade AI chatbot solution is essential. Streamlining this effort for a complex enterprise deployment without automation is impossible.

  • Every small change in the chatbot prompting instructions requires a full regression test of all chatbot functionality and a re-validation of security guardrails. This continuous testing effort cannot be supported at scale with only manual testers.
  • Conversational AI chatbots in the enterprise are typically only one component of an automation solution. For example, a virtual assistant deployed in an enterprise may rely on models based on in-house training data and leverage APIs to an internal e-commerce system or other machine learning models. When any changes occur to connected systems, a complete end-to-end regression is required to validate that the chatbot satisfies all critical use cases. This type of extensive testing is only possible through automation.
  • The requirement for supporting multiple baselines to evaluate the variable nature of chatbot responses correctly is difficult to handle without automation, as each tester must be familiarized with a broad spectrum of responses that are considered valid.
  • Testing for security guardrail enforcement involves carefully crafted test cases based on knowledge of the vulnerabilities of state-of-the-art language models. These skills are beyond the scope of the testing services provided in most organizations.

How do traditional automation testing tools fall short when testing AI chatbots?

Traditional automation testing tools or automation providers leverage a set oftechniques and technologies that fall short of the requirements for the testingof chatbots. These shortcomings include:

  • Selenium or similar technologies record and replay tests for automated test execution. While this approach works well for testing traditional user interfaces, it falls short when facing the varied formats of chatbot responses in a conversational flow.
  • Each test scenario is defined based on input steps and a specific expected output. Since LLM-based chatbots produce variability in their responses, testing chatbots requires a valid semantic comparison of results against multiple acceptable benchmarks.
  • Chatbot development necessitates a highly iterative process where LLM prompts are continuously modified until an optimal overall solution is achieved. Traditional testing automation tools are capable of supporting the extensive a/b testing that is required to streamline this effort.

How is chatbot testing different from traditional application testing or traditional automation testing?

AI chatbots, powered by ChatGPT or other natural language models, provide a user-friendly experience for users seeking information or initiating actions. However, they pose several unique testing challenges:

  • Unlike traditional applications where all logic is defined with explicit code, the conversational artificial intelligence used in chatbots uses NLP (natural language processing) instructions that are intrinsically vague (think of how you might try to measure conversational flow or conversational tone).
  • Even small changes in the prompt for the LLM can cause dramatic changes in behavior. It is challenging to validate that a prompt change intended to solve a specific issue is not causing unintended problems with other responses.
  • While traditional web UIs provide a finite set of possibilities for user interactions, AI chatbots are based on generative natural language processing and allow for infinite variations in conversation flow.
  • Unlike traditional UI applications where user intent is clear based on the action performed in the UI, chatbots must correctly infer user intent from the conversation.

What tests should be included in an artificial intelligence chatbot test suite?

Three primary areas of testing should be covered in an effective chatbot test suite in addition to the standard developer-maintained validation unit tests that are executed as part of the DevOps deployment pipeline for the bot:

  • Functionality Testing: Verify the chatbot's ability to perform all intended actions to maintain the expected level of functionality and usability. Test for correct responses to various inputs, including the support of multi- pass conversational flow required for a good customer experience. Check the chatbot's handling of invalid or unexpected user inputs.
  • Security Testing: Evaluate adherence to critical data security and privacy standards. Test for vulnerabilities to hacking or unauthorized access initiated through responses to user queries. Verify the handling of sensitive user information to ensure compatibility with internal data governance standards.
  • Performance Testing: Evaluate the performance of the entire test suite compared to established SLAs and prior production performance benchmarks. Identify test cases that perform significantly slower than in prior chatbot versions. Perform load testing to ensure acceptable performance is sustained while the chatbot is under the expected levels of peak concurrency.

What are the metrics that measure the success of the chatbot testing process?

Key metrics that should be measured in automated chatbot testing include:

  • Success rate of individual test cases as measured through a semantic evaluation against all valid test baselines. The total number of failed test cases should also be reported.
  • Overall test suite success rate compared to test suite results collected against the current production system.
  • Response time distribution for cases in a test suite.
  • Number and % of test cases whose response time falls below current production SLAs or a production performance baseline.