12 min read

Lessons from 10+ AI Conferences on Gen AI Application Development

Published on

February 12, 2025

Contributor(s)

Dr. Nicolai Bohn

Founder

Subscribe to our newsletter

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Over the past months, I attended more than 10 AI conferences, including PAKcon, the AIAI Summit, the AI & Data Summit, the Trustworthy AI Forum, and AICon. Engaging with AI engineers from large and medium-sized companies, as well as consultants specializing in generative AI (Gen AI), provided valuable insights into the current challenges of moving Large Language Model (LLM) applications from proof of concept (PoC) to production. This post synthesizes those insights and introduces a framework for understanding the key obstacles organizations face.

The Production Trifecta: Governance, regulation, and evaluation

While every conference approached AI from different perspectives, a common theme emerged: many promising Gen AI projects fail to make it beyond the PoC stage. Three major hurdles surfaced repeatedly in discussions—AI governance, AI regulation, and AI evaluation. These three dimensions form what I call the Production Trifecta: the interconnected elements that determine whether a Gen AI application can scale successfully.

AI Governance: Laying the foundation

Many AI practitioners expressed frustration over the absence of clear governance structures within their organizations. Without well-defined processes and requirements, ensuring responsible development and scalable deployment becomes difficult. Key governance concerns include:

Ethical Guidelines: Organizations must establish clear policies for addressing bias, ensuring data privacy, and maintaining transparency in Gen AI applications. This includes defining acceptable data sources, implementing bias mitigation strategies, and providing explainability for AI-generated decisions.
Role Definition: Effective AI governance requires well-defined roles and responsibilities. Organizations should designate accountability for quality assurance, compliance oversight, and ethical AI use across teams, ensuring collaboration between data scientists, legal experts, and business stakeholders.
Monitoring & Auditing: AI systems must be continuously monitored to ensure performance stability and adherence to ethical and legal standards. Implementing robust auditing mechanisms—including automated monitoring tools, periodic compliance reviews, and performance benchmarking—helps maintain trust and reliability in AI applications.

Organizations that fail to establish robust governance frameworks risk developing Gen AI applications that cannot be properly assessed, refined, or trusted.

AI Regulation: Navigating a complex landscape

The regulatory environment for AI is evolving rapidly, creating uncertainty for companies attempting to scale their applications internationally. The OECD now tracks over 1,000 AI-related regulations globally, and frameworks like the EU AI Act introduce significant compliance challenges:

AI Risk Classification: Many organizations struggle to determine whether their AI applications fall into "medium / high-risk" categories, affecting compliance requirements. The role of sandboxes, external red teaming, and evaluations in mitigating risks remains unclear to many, leading to hesitation in risk classification decisions.
Transparency Obligations: Some regulations mandate disclosures when users interact with AI systems, necessitating changes in UI design and communication strategies. Companies often lack clarity on how to balance transparency requirements with usability and user experience concerns.
Data Governance Standards: Strict requirements around data quality and documentation demand a level of oversight that many organizations are unprepared for. Organizations frequently face challenges in defining what constitutes "sufficient" documentation and ensuring compliance across jurisdictions.

Missteps in regulatory compliance can prevent an AI system from being legally deployed, even if it is technically sound.

AI Evaluation: The challenge of non-determinism

The most pressing technical challenge discussed at the conferences was the evaluation of LLM applications. Unlike traditional software, Gen AI applications do not follow deterministic rules, making standard testing approaches inadequate. Evaluation is further complicated by:

Output Variability: The same prompt can yield different results, requiring probabilistic rather than deterministic evaluation. This makes it difficult to define test cases that consistently verify application behavior across various scenarios.
Context Sensitivity: Performance varies based on prompt phrasing and surrounding context, often requiring close collaboration between domain experts and AI engineers to ensure that test scenarios reflect real-world applications accurately.
Ethical Considerations: Evaluations must detect biases and compliance risks in addition to functional correctness. However, organizations struggle to understand how test scenarios were structured and created, making it challenging to assess fairness, reliability, and compliance effectively.

To address these issues, evaluation frameworks must evolve beyond traditional software testing methodologies.

Rethinking AI Evaluation: A structured approach

While governance and regulation are necessary conditions for production-readiness, effective evaluation is what ensures that Gen AI applications are robust, reliable, and compliant.

Core Test Dimensions

Robustness: Does the AI system handle edge cases, adversarial inputs, and unexpected queries effectively? Ensuring robustness requires leveraging frameworks such as OWASP’s AI Security guidelines and MITRE’s ATLAS framework to identify vulnerabilities and establish proactive mitigation strategies. Effective robustness testing must integrate adversarial testing techniques, simulate real-world misuse scenarios, and refine Gen AI applications iteratively based on discovered weaknesses.
Reliability: Does the AI system consistently perform well within the target domains of the application? Reliability involves staying within defined behavior boundaries and ensuring the application does not act outside of its intended scope. This includes preventing hallucinations, maintaining alignment with domain-specific knowledge, and avoiding unintended deviations in response patterns.
Compliance: Does the system meet legal and ethical standards? AI systems must navigate an extensive and evolving landscape of legal requirements across different jurisdictions, including GDPR in Europe, the AI Bill of Rights in the U.S., and sector-specific guidelines in finance, healthcare, and defense. Ensuring compliance involves continuous monitoring, documentation, and adaptation to new regulations. Organizations often struggle with harmonizing compliance efforts across multiple regulatory frameworks, understanding regional variations, and integrating compliance checks into their AI development lifecycle.

As a quick fix, many organizations rely on a "golden data set" for evaluation during POC stage—testing on a few hundred curated cases. However, this approach is insufficient given the variability of Gen AI outputs. A more comprehensive evaluation strategy is required.

The role of test scenario management and evaluation tools

One recurring theme across conferences was the lack of high-quality test and evaluation sets for Gen AI applications. Organizations often struggle to define what constitutes a meaningful test case. Effective evaluation requires a dynamic, collaborative, and context-specific approach to building test sets.

Diverse: Covering a range of input scenarios

One of the major challenges in testing Gen AI systems is ensuring that the test sets cover a broad spectrum of input scenarios. These should go beyond what is expected from typical users and account for edge cases, adversarial inputs, and atypical behaviors. This can require collaboration between AI engineers and domain experts from various fields such as—finance, healthcare, insurance, and more—who can help identify the nuances of real-world applications.

For example, domain experts bring insights into how AI systems should behave in their respective industries, helping design tests that go beyond conventional user interactions. These collaborations ensure that the test cases are not only diverse but also representative of complex, real-world scenarios that AI systems may face. As a result, test sets can cover everything from routine use cases to rare edge cases that could significantly impact performance.

Context-Specific: Tailored to industry and use-case

Test sets for Gen AI applications need to be tailored to the specific industry and use case at hand. Unlike traditional software, where generic tests may suffice, the unique nature of Gen AI systems requires test cases that reflect the context in which the Gen AI will be deployed.

In industries like healthcare, where AI-driven diagnostics might be used, tests must ensure that the system adheres to strict ethical and legal standards while maintaining high accuracy. In financial services, tests may need to evaluate transparency and fairness in decision-making, particularly in areas like lending or insurance underwriting. This level of specificity in test cases demands close collaboration between AI engineers and domain experts who understand the sector’s regulatory requirements, customer expectations, and operational risks. By incorporating this domain knowledge into the test set creation process, organizations can ensure that their Gen AI applications are not only technically sound but also ethically and legally compliant.

Conclusion: Bridging the gap to success

For AI applications to successfully transition from PoC to production, organizations must simultaneously address governance, regulation, and evaluation. Establishing governance frameworks ensures responsible development, regulatory awareness prevents compliance roadblocks, and comprehensive evaluation methodologies validate Gen AI behaviors.

By proactively tackling these challenges, companies can unlock the full potential of Gen AI while ensuring their applications are reliable, trustworthy, and legally compliant. The road to production may be complex, but with the right strategies, it is entirely navigable.

‍

Join us on Discord

Connect with a community focused on getting LLM applications & AI agents into production with confidence! Collaborate on context-specific test cases, build large-scale test suites, and advance your evaluations to ensure reliability, robustness, and compliance.

Connect now