Large language models are increasingly being leveraged to evaluate clinical documentation against utilization management (UM) criteria. Through real-world examples we will explore how naïve prompting against general-purpose foundation models often leads to incorrect or inconsistent determinations. By applying physician-developed, context-informed prompts on the backdrop of a curated knowledge base, AI output becomes more accurate and aligned with clinical guidelines, offering a path toward safe and trustworthy integration into UM workflows.