The US Department of Defense’s (DoD) Chief Digital and Artificial Intelligence Office (CDAO) has concluded a pilot program under its Crowdsourced AI Red-Teaming (CAIRT) Assurance initiative. The program, focused on testing the use of Large Language Model (LLM) chatbots for military medical applications, aimed to identify vulnerabilities and ensure responsible AI use within the DoD.
Conducted in collaboration with the Defense Health Agency (DHA) and the Program Executive Office, Defense Healthcare Management Systems (PEO DHMS), the pilot was led by Humane Intelligence, a technology firm specializing in algorithmic evaluations. The exercise explored two prospective use cases: clinical note summarization and a medical advisory chatbot.
Over 200 participants, including clinicians and healthcare analysts from the DHA, Uniformed Services University of the Health Sciences, and other military services, participated in the exercise. Their efforts uncovered more than 800 potential vulnerabilities and biases across three popular LLMs. These findings will contribute to the creation of benchmark datasets, which will serve as tools for evaluating future AI vendors and ensuring alignment with performance and security expectations.
“This program acts as an essential pathfinder for generating a mass of testing data, surfacing areas for consideration, and validating mitigation options,” said Dr. Matthew Johnson, the CDAO lead for the initiative. The findings are expected to shape DoD policies and best practices for employing Generative AI (GenAI) in military medicine, enhancing mission effectiveness while adhering to risk management requirements under OMB M-24-10 guidelines.
The CAIRT program employs crowdsourcing and red-teaming methodologies to stress-test AI systems for vulnerabilities. Red-teaming uses adversarial techniques to identify system weaknesses and has previously been applied in other CDAO initiatives, including a spring 2024 financial bounty exercise.