“But can’t we just use Copilot instead?” an underwriter asked during a demo of our new gen AI-powered Insurance Underwriting platform. The application includes, among other features, a questionnaire covering topics relevant to assessing risk exposure for our specialty insurance clients.
Indeed, you could ask Copilot these questions directly—but that’s not the real value of our Insurance Underwriting platform. Our proposal goes beyond basic Q&A. The questionnaire is embedded in a broader analytical framework that combines quantitative and qualitative components. It begins with industry-average loss estimations (modeled using Exceedance Probability Curves, a standard in catastrophe modeling) and then drills down into the nuances and specificities of the company under analysis through our gen AI-enabled Scorecard.
This combination delivers a thorough, in-depth review of key risk factors, producing in minutes what would normally take days of combing through exhaustive documentation.
“How Reliable is AI for This?”
another underwriter asked—a common and very valid concern. Mainstream insurance underwriting models have hallucination rates ranging from 5% to 15%, which is particularly worrisome in an industry that demands certainty.
To better understand this risk in our context, we asked underwriters to complete the questionnaires using their expertise and judgment—the human element. We then compared their answers with the AI’s output. How often do they align? Were the AI’s responses reasonable?
After the first round, we saw room for improvement and devised a plan:
- Restrict sources: Our insurance underwriting LLM does not fetch answers from the open web. Underwriters must submit the documentation they consider relevant. This avoids unreliable sources like Quora or Reddit and ensures only whitelisted documents are used.
- Explicit fallback: Our master prompt instructs the LLM to respond with “Not available in the documentation provided” if the answer isn’t found. This simple measure significantly reduced hallucinations.
- Question variants: Each scorecard item has two to three alternative phrasings that trigger if the initial version fails. This improves response rates, especially given cultural and linguistic differences across countries and reports.
After applying these measures and repeating the comparison exercise, accuracy (human/AI agreement) and response rates increased sharply, with minimal hallucinations.
“And if I disagree with the AI output?”
came the final question. This is central to our design philosophy. At all times, underwriters can overwrite and modify AI outputs, adding comments, reasoning, and justification. Even final scores can be adjusted, ensuring the ultimate assessment remains theirs. They keep their hands firmly on the steering wheel—and that’s what I’m proudest of in our implementation.
We view AI as a tool, an enabler. But it is the human who makes the decision and has the final say. The underwriters left the Insurance Underwriting platform demo session excited, and I look forward to reconnecting with them in a couple of weeks to hear their feedback.
Read more about how Evalueserve utilized gen AI to boost a client's cyber underwriting
Talk to One of Our Experts
Get in touch today to find out about how Evalueserve can help you improve your processes, making you better, faster and more efficient.


