Over the past months, I attended more than 10 AI conferences, including PAKcon, the AIAI Summit, the AI & Data Summit, the Trustworthy AI Forum, and AICon. Engaging with AI engineers from large and medium-sized companies, as well as consultants specializing in generative AI (Gen AI), provided valuable insights into the current challenges of moving Large Language Model (LLM) applications from proof of concept (PoC) to production. This post synthesizes those insights and introduces a framework for understanding the key obstacles organizations face.
While every conference approached AI from different perspectives, a common theme emerged: many promising Gen AI projects fail to make it beyond the PoC stage. Three major hurdles surfaced repeatedly in discussions—AI governance, AI regulation, and AI evaluation. These three dimensions form what I call the Production Trifecta: the interconnected elements that determine whether a Gen AI application can scale successfully.
Many AI practitioners expressed frustration over the absence of clear governance structures within their organizations. Without well-defined processes and requirements, ensuring responsible development and scalable deployment becomes difficult. Key governance concerns include:
Organizations that fail to establish robust governance frameworks risk developing Gen AI applications that cannot be properly assessed, refined, or trusted.
The regulatory environment for AI is evolving rapidly, creating uncertainty for companies attempting to scale their applications internationally. The OECD now tracks over 1,000 AI-related regulations globally, and frameworks like the EU AI Act introduce significant compliance challenges:
Missteps in regulatory compliance can prevent an AI system from being legally deployed, even if it is technically sound.
The most pressing technical challenge discussed at the conferences was the evaluation of LLM applications. Unlike traditional software, Gen AI applications do not follow deterministic rules, making standard testing approaches inadequate. Evaluation is further complicated by:
To address these issues, evaluation frameworks must evolve beyond traditional software testing methodologies.
While governance and regulation are necessary conditions for production-readiness, effective evaluation is what ensures that Gen AI applications are robust, reliable, and compliant.
Core Test Dimensions
As a quick fix, many organizations rely on a "golden data set" for evaluation during POC stage—testing on a few hundred curated cases. However, this approach is insufficient given the variability of Gen AI outputs. A more comprehensive evaluation strategy is required.
One recurring theme across conferences was the lack of high-quality test and evaluation sets for Gen AI applications. Organizations often struggle to define what constitutes a meaningful test case. Effective evaluation requires a dynamic, collaborative, and context-specific approach to building test sets.
One of the major challenges in testing Gen AI systems is ensuring that the test sets cover a broad spectrum of input scenarios. These should go beyond what is expected from typical users and account for edge cases, adversarial inputs, and atypical behaviors. This can require collaboration between AI engineers and domain experts from various fields such as—finance, healthcare, insurance, and more—who can help identify the nuances of real-world applications.
For example, domain experts bring insights into how AI systems should behave in their respective industries, helping design tests that go beyond conventional user interactions. These collaborations ensure that the test cases are not only diverse but also representative of complex, real-world scenarios that AI systems may face. As a result, test sets can cover everything from routine use cases to rare edge cases that could significantly impact performance.
Test sets for Gen AI applications need to be tailored to the specific industry and use case at hand. Unlike traditional software, where generic tests may suffice, the unique nature of Gen AI systems requires test cases that reflect the context in which the Gen AI will be deployed.
In industries like healthcare, where AI-driven diagnostics might be used, tests must ensure that the system adheres to strict ethical and legal standards while maintaining high accuracy. In financial services, tests may need to evaluate transparency and fairness in decision-making, particularly in areas like lending or insurance underwriting. This level of specificity in test cases demands close collaboration between AI engineers and domain experts who understand the sector’s regulatory requirements, customer expectations, and operational risks. By incorporating this domain knowledge into the test set creation process, organizations can ensure that their Gen AI applications are not only technically sound but also ethically and legally compliant.
For AI applications to successfully transition from PoC to production, organizations must simultaneously address governance, regulation, and evaluation. Establishing governance frameworks ensures responsible development, regulatory awareness prevents compliance roadblocks, and comprehensive evaluation methodologies validate Gen AI behaviors.
By proactively tackling these challenges, companies can unlock the full potential of Gen AI while ensuring their applications are reliable, trustworthy, and legally compliant. The road to production may be complex, but with the right strategies, it is entirely navigable.