Info-Hub

Info-Hub

What Really Makes GenAI Projects Work – and Why Many Don’t

Reflections from Building Real-World AI Solutions

Author: Kalaivani K.G. Chief Innovation Officer exf Financial Data Solutions

In most GenAI PoCs, “success” is often defined as getting a prototype to stand up and produce something that looks roughly right. The demo works, everyone’s excited, and that’s usually enough to move it into a full project. But that’s also where things start to break down, and the project may struggle or fail to gain adoption.

Usefulness and adoption don’t come from an AI application simply generating something plausible. They depend on how accurate and reliable it is for that specific business context, not just in generic benchmarks. They also come from understanding the system’s limitations and designing the UI and workflows to work around those limitations. These aspects are what PoCs often overlook.

Teams pour energy into getting the tech stack, data security, pipelines, and guardrails right and they usually do. The engineering part tends to get figured out, and more importantly, it’s testable. We have frameworks for everything: guardrails, evaluation, unstructured-data processing, prompt orchestration. Yet so many AI projects stall after the demo stage. Why? 

Let’s use the Five Whys to dig deeper.

1. Why do AI projects fail?

They often fail at the point of user adoption — that’s where things break down.

2. Why don’t users adopt them?

Because they don’t find them useful.

3. Why aren’t they useful?

They don’t make everyday work easier; especially the tedious, painful, or error-prone parts. Or sometimes, users stop trusting the results.

4. Why don’t they fulfil user needs?

Because the tech teams end up designing and evaluating PoCs and Projects on their own.

5. Why is that a problem?

There are several reasons:

a. The PoC is treated like a software application.
We celebrate getting it to “stand up and run”. Architecture, data security, latency all are important. But often the success criteria are fuzzy or cursory: tech-led evaluation, quick PoC, “looks good” eyeball review.

b. Heavy reliance on cloud “black-box” services.
We use pre-built services such as copilots or NL to SQL agents from cloud providers. They help us build PoCs fast but hide many of the knobs and levers. Each use-case has its own nuance. For example, when you’re working with document preprocessing, chunking strategy, context engineering, all play a very important and big part in accuracy.  Prompt engineering can only take you so far and relying too much on Prompt Engineering tends to make the application brittle.

c. Business users weren’t central in the UI or design review.
It is unrealistic to expect a business user to review SQL generated by the AI, or to inspect 100s of fields extracted from 1,000s of documents day in and day out. If the “human in the loop” isn’t designed for the user’s reality, the system fails the user.

d. The boundaries and limitations aren’t clearly defined.
Every AI system has blind spots. If we don’t spell out where it will work and where it won’t, then “90% accurate” is almost meaningless as we still don’t know which 10% will mis-behave. When we identify and acknowledge the limitations up-front, we design workflows that work to handle them rather than pretending they don’t exist. 

e. Ignoring the semantic layer — the bridge between business and data.
In many AI systems, the real success factor lies in how clearly we define the semantic layer i.e, the shared language that connects business meaning to data. Too often, this is treated as an afterthought while we focus on engineering, architecture, or UI design.

When that layer is weak or missing, the system may technically work but conceptually fail. Users see results that don’t match their mental model of the business. 

For instance, in a buy-side asset-management context, say the term “exposure” is defined differently across portfolios — one team using gross exposure, another net, and a third looking through derivatives. An AI system built without that nuanced business understanding will produce inconsistent or misleading analytics. Involving business users early to co-define this layer ensures that models, queries, and dashboards reflect a consistent view of the business and builds trust.

What can teams do differently?

  1. Define PoC success criteria upfront.
    Before building anything, clarify what success means, both technically and from a business perspective. Who signs off, and based on what? This gives direction and prevents subjective “it looks good” validations later.
  2. Involve business users early.
    Engage them right from problem definition. Understand their daily pain points, what “value” means to them, and how they will judge success. Their context should drive both design and evaluation.
  3. Define accuracy in context.
    Move beyond generic accuracy or benchmark metrics. A PoC that produces something “close enough” isn’t necessarily useful if it’s wrong in the places that matter most.
    Each use case needs its own definition of correctness such as which fields, conditions, or user actions must be right for it to build trust. Collaborate with business users to define what “acceptable accuracy” means for theirworkflow before evaluation begins.
  4. Identify boundaries and limitations.
    Instead of claiming broad accuracy, define the exact boundaries of where the system works well and where it might struggle. Document those assumptions clearly — the data ranges, document formats, exception scenarios, or prompts that tend to fail. This clarity helps both design and adoption. Teams can shape workflows, review steps, and UI cues to handle those weak areas gracefully. When users know where to rely on the system and where to be cautious, trust builds naturally.
  5. Co-create the evaluation plan.
    Define test cases, metrics, and edge scenarios together. Get the business team’s buy-in before you begin implementation. This makes evaluation collaborative, not adversarial.
  6. Design for configurability.
    Build systems where most parameters like prompts, thresholds, preprocessing options, context settings, and evaluation rules live in a configuration file or database. When levers are configurable instead of hard-coded, teams can iterate faster and adapt to evolving requirements without re-engineering. It also bridges the gap between tech and business, allowing domain experts to tune behavior without code changes.
  7. Build, test, iterate.
    Treat evaluation not as a final step but as a continuous feedback loop. Configurability accelerates this cycle: adjust, observe, and refine. Over time, the system becomes more reliable and aligned with user reality.

These lessons can only be grasped after a few projects, the ones that looked perfect on paper but never took off in practice. They aren’t taught in courses; they surface in the quiet user reactions or disengagement that says something is off.

AI systems don’t fail because of poor engineering alone.
They fail because they forget the human at the other end.