Back to blog
Tutorials December 28, 2025

Why Do 90% of AI Projects Fail? Complete Guide to Data Preparation for AI

Pavel Čermák
Pavel Čermák
Managing Director, CTO
Why Do 90% of AI Projects Fail? Complete Guide to Data Preparation for AI
1

Introduction: The Hidden Truth About AI Projects

You bought an expensive AI tool. Deployed it. And now? It hallucinates. Responds with nonsense. Returns outdated information.

Why? Because you skipped the most important step – data preparation.

According to the Data Trust Report 2025, 41% of organizations admit problems with maintaining consistent data quality, which negatively affects AI implementation. And that's just the tip of the iceberg.

"Bad data = bad AI. It's that simple."

In this article, we'll show you:

  • Why data quality is critical for AI success
  • 4 main problems with RAG projects (95% fail in production)
  • How to properly prepare data for RAG systems
  • 5 chunking strategies with concrete examples
  • What RAGus.ai is and how it solves the hallucination problem


1

Main Problems with RAG & Gen AI Projects

95% of RAG projects fail in production. Why? We've identified 4 main causes:

4 main problems with RAG projects

Problem #1: Data Structure

How to upload company data (tens of thousands of unstructured URLs and documents) into a knowledge base in structured format?

  • A: Data dumped into the KB without any preprocessing / proper structure
  • B: KB cannot return valid information to answer user queries with precision and without hallucinations
  • C: Data needs to have high quality and consist of unique information (removing unnecessary boilerplate)

Problem #2: Chunking

In an ideal world, each document or URL's content would sit in its own fit chunk with cleaned and well-structured information.

Reality:

  • Vector stores have token limits: 500–4096 tokens per chunk
  • Most platforms use naive chunking strategies:

- Divide content by fixed token counts
- Allow around 800-1500 tokens for single chunk
- Text cut in the middle of sentences
- Context lost = retrieval fails
- Fixed token overlap

Problem #3: Metadata

Each chunk should have rich and detailed metadata:

  • Origin (document filename / page number, URL link)
  • Fields for filtering
  • Current chunk summary
  • Chunk overlap summary
  • Source page summary

Problem #4: Automated Updates

  • KB built once, never updated
  • AI gives old information
  • Or simply hallucinates

💡 Don't want to handle this yourself? As an AI company with 30+ successfully completed projects, we offer turnkey data preparation. From audit to final integration — we take care of everything.


2

What is RAG and Why is it Key to Accurate AI

Retrieval-Augmented Generation (RAG)

RAG is a technology that combines searching for relevant information with the generative capabilities of AI models.

RAG Concept (Retrieval-Augmented Generation)

Instead of AI answering only from its trained knowledge (which may be outdated), RAG:

  1. Retrieval: Finds relevant documents in the knowledge base
  2. Augmentation: Adds found information to the context
  3. Generation: AI generates a response based on current data

Why RAG Works Better Than Classic LLM

AspectClassic LLMRAG
Data freshness❌ Outdated (training)✅ Current (live data)
Source citations❌ Cannot verify✅ Link to source
Hallucinations❌ Frequent✅ Minimal
Accuracy~70-80%90-99%

3

How to Properly Prepare Data for AI

4-Step Data Preparation Process

4-step data preparation process for AI

Step 1: Source Mapping

We go through everything – web, documents, databases, emails, internal systems. We find out what you have and in what condition.

Step 2: Cleaning and Unification

Gone with duplicates. Gone with inconsistencies. One source of truth. One structure.

Step 3: Enrichment and Splitting

We add metadata, summaries, relationships. We split using optimal strategy. AI then knows where to look.

Step 4: Creating Your "Second Brain"

We upload everything into one knowledge base – your central source of truth.

💡 Don't want to handle this yourself? We offer professional data preparation turnkey. Experience from 30+ projects.


4

Chunking Strategies: Complete Guide with Examples

AI doesn't read entire documents. It works with "chunks" – pieces of text. How you split them determines how it will answer.

Bad splitting = bad results.

Example Text for All Strategies

For comparison, we'll use the same text about the Kepler space telescope:

TEXT
# The Kepler Space Telescope

The Kepler space telescope was a space observatory launched by NASA to discover Earth-size planets orbiting other stars. It was named after astronomer Johannes Kepler.

## Mission Overview

Kepler's sole instrument was a photometer that continually monitored the brightness of approximately 150,000 main sequence stars in a fixed field of view. It looked for periodic dimming, known as transits, caused by planets crossing in front of their host star.

## Major Discoveries

During its nine-year mission, Kepler observed 530,506 stars and confirmed 2,662 planets. A key finding was that 20-50% of stars visible to the naked eye likely have small, rocky planets similar in size to Earth.

## End of Operations

The mission ended in 2018 when the spacecraft ran out of fuel. The telescope was deactivated, but its data continues to yield new discoveries.

Strategy 1: Fixed Size Splitting

Technical name: Recursive Character Splitter / Fixed Token Count

How it works: Splits text every N tokens, regardless of content.

Example Output:

ChunkContentProblem
1"The Kepler space telescope was a space observatory... ## Mission Overview Kepler's sole instrument was a"❌ Cuts mid-sentence
2"photometer that continually monitored the brightness... caused by planets crossing in"❌ Starts and ends mid-sentence
3"front of their host star. ## Major Discoveries During its nine-year mission..."❌ Mixes unrelated sections

Best for: When speed matters more than quality, or you have a tight budget.

Downsides: Often cuts thoughts, loses context.

💰 Cost: Low | 📈 Quality: ⭐ Low


Strategy 2: Header-Based Splitting

Technical name: Markdown Header Chunking

How it works: Looks for headings (#, ##) and keeps everything under each heading together.

Example Output:

ChunkContent
1# The Kepler Space Telescope — "The Kepler space telescope was a space observatory... named after Johannes Kepler."
2## Mission Overview — "Kepler's sole instrument was a photometer... planets crossing in front of their host star."
3## Major Discoveries — "During its nine-year mission, Kepler observed 530,506 stars..."
4## End of Operations — "The mission ended in 2018 when the spacecraft ran out of fuel..."

Perfect! Each section stays intact.

Limitation: Doesn't work if document has no headers (chat, emails, transcripts).

💰 Cost: Low | 📈 Quality: ⭐⭐ Medium


Strategy 3: Meaning-Based Splitting (Semantic Chunking)

Technical name: Semantic Chunking

How it works: AI measures how "similar" each sentence is to the next. When the topic changes significantly, it creates a new chunk.

Example Output:

ChunkWhy Grouped Together
"The Kepler space telescope was a space observatory... It looked for periodic dimming, known as transits."AI sees "telescope," "instrument," "monitor" as related concepts
"During its nine-year mission, Kepler observed 530,506 stars and confirmed 2,662 planets..."Numbers and discoveries = new topic
"The mission ended in 2018 when the spacecraft ran out of fuel...""End," "deactivated" = termination topic

Benefit: Groups ideas together even if document has no headers.

Downside: Slower, needs tuning.

💰 Cost: Medium | 📈 Quality: ⭐⭐⭐ High


Strategy 4: AI-Decided Splitting (Agentic/LLM-Based) ⭐

Technical name: LLM-Based Chunking / Agentic Chunking

How it works: AI reads the entire text and decides the best places to split based on logic and flow.

⚠️ Important: Why NOT to ask AI to output the full text with split markers

  • 💰 Costly (duplicates all input tokens as output)
  • 🐢 Slow (massive output generation)
  • Hallucination risk (AI may alter the original text)
  • 📏 Output token window limitations

The Correct 3-Phase Process:

PHASE 1: Pre-Split with Index Tags (No AI needed)

Split text into small fixed-size pieces (~50 tokens) and add numbered tags:

TEXT
<chunk_0>The Kepler space telescope was a space observatory launched by NASA to discover Earth-size planets orbiting other stars.</chunk_0>

<chunk_1>It was named after astronomer Johannes Kepler. Kepler's sole instrument was a photometer that</chunk_1>

<chunk_2>continually monitored the brightness of approximately 150,000 main sequence stars in a fixed field of view.</chunk_2>

<chunk_3>It looked for periodic dimming, known as transits, caused by planets crossing in front of their host star.</chunk_3>

<chunk_4>During its nine-year mission, Kepler observed 530,506 stars and confirmed 2,662 planets.</chunk_4>

<chunk_5>A key finding was that 20-50% of stars visible to the naked eye likely have small, rocky planets similar in size to Earth.</chunk_5>

<chunk_6>The mission ended in 2018 when the spacecraft ran out of fuel.</chunk_6>

<chunk_7>The telescope was deactivated, but its data continues to yield new discoveries.</chunk_7>

PHASE 2: AI Decision (Minimal output — just numbers!)

Ask AI: "Return ONLY the chunk indices where semantic splits should occur."

TEXT
AI Response: split_after: 0, 3, 5, 7

AI outputs only ~10 tokens, not the entire document!

PHASE 3: Programmatic Merge (No AI needed)

Based on split_after: 0, 3, 5, 7, merge chunks:

Final ChunkSourceContent
1chunk_0"The Kepler space telescope was a space observatory launched by NASA..."
2chunks 1-3"It was named after astronomer Johannes Kepler. Kepler's sole instrument was a photometer..."
3chunks 4-5"During its nine-year mission, Kepler observed 530,506 stars and confirmed 2,662 planets..."
4chunks 6-7"The mission ended in 2018 when the spacecraft ran out of fuel. The telescope was deactivated..."

Result: AI decided semantically optimal split points, original text preserved exactly, low token cost.

💰 Cost: Medium | 📈 Quality: ⭐⭐⭐⭐ Very High


Strategy 5: Summarization / Q&A Format

Technical name: Structured Output / Q&A Format / Propositional Indexing

How it works: Instead of keeping the original text, extract pure facts as question-answer pairs or structured JSON.

Example Output:

QuestionAnswer
What was the Kepler Space Telescope?A NASA observatory designed to discover Earth-size planets.
What instrument did Kepler use?A photometer that monitored star brightness.
How many planets did Kepler confirm?2,662 planets.
How many stars did Kepler observe?530,506 stars.
Why did Kepler's mission end?The spacecraft ran out of fuel in 2018.

Best for: FAQ systems, databases, when users ask specific questions.

Downside: Loses the story and context between facts. Can hallucinate.

💰 Cost: High | 📈 Quality: ⭐⭐⭐⭐ Very High


📊 Chunking Strategies Comparison Table

#StrategyTechnical NameWhat It Does✅ Pros❌ Cons💰 Cost📈 Quality
1Fixed SizeRecursive Character SplitterCuts every N tokensFast, simpleCuts sentences, loses context💲 Low
2By HeadersMarkdown Header ChunkingSplits at headings #, ##Keeps sections togetherRequires structure💲 Low⭐⭐
3By MeaningSemantic ChunkingAI detects topic changeSmart groupingSlower, needs tuning💲💲 Medium⭐⭐⭐
4AI-DecidedLLM-Based / AgenticAI reads and decides splitsHandles messy textRequires AI processing💲💲 Medium⭐⭐⭐⭐
5SummarizationStructured Output / Q&AConverts to Q&A pairsPerfect for FAQLoses narrative💲💲💲 High⭐⭐⭐⭐

💡 Pro Tip: If AI-Decided chunking creates a chunk that exceeds your vector store's limit (e.g., 8,000 tokens), combine it with Summarization. AI decides boundaries → then oversized chunks get condensed.


5

Enhancing Retrieval Quality (Enhancement Strategies)

Chunking is just the beginning. Top-tier RAG systems use enhancement techniques that give each piece of text "superpowers":

Add Document Summary Context

We add global context – document title, main topic – to every chunk.

Without context (problem):

TEXT
"The mission ended in 2018 when the spacecraft ran out of fuel."

❌ AI doesn't know which mission!

With Document Summary:

TEXT
[Source: NASA Kepler Space Telescope mission (2009-2018), exoplanet detection]
"The mission ended in 2018 when the spacecraft ran out of fuel."

✅ Now a search for "fuel" or "Kepler" returns a chunk that makes sense on its own.

Search Keywords Generation

We insert likely user questions and keywords before the actual text:

TEXT
kepler fuel end mission 2018 shutdown why did kepler stop | The mission ended in 2018...

Enhancement Strategies Comparison

EnhancementWhat It Does💰 Cost📈 Quality
Document SummaryAdds context to each chunk💲💲 Medium⭐⭐⭐
Search KeywordsAdds search queries💲💲💲 High⭐⭐⭐⭐
Both CombinedMaximum accuracy💲💲💲 High⭐⭐⭐⭐⭐ Elite

6

What Makes Data "AI-Ready"?

7 Key Properties of Quality Data

7 properties of AI-ready data
  1. Complete thoughts, not fragments – Text isn't cut off mid-sentence. AI gets complete information.
  2. Clear hierarchy – AI knows exactly where to look for answers and where supporting data is.
  3. Pre-prepared questions – Each block has associated questions it answers.
  4. Summary for each block – AI immediately understands context without reading the entire document.
  5. Connections between parts – Each block knows what came before. AI understands relationships.
  6. Metadata for filtering – Date, category, source. AI can search exactly where it needs to.
  7. Origin of each piece – AI can cite sources and you know it's not made up.

Example: Bad vs. Good Data

❌ Typical Reality (raw_data.txt):

Chaos. Duplicity. Missing context. AI guesses.

TEXT
Office hours Monday 8-17 Tuesday closed
Wednesday 8-12 and 13-17 Office hours: Mon
8:00-17:00, Tue: closed, Wed: 8-12, 13-17
OFFICE HOURS Monday eight to seventeen
Opening hours: Mon 8-17 city office
open from 8 to 5 pm on Monday
Tuesday is closed Wednesday half day then
again from one o'clock afternoon contact
tel. 123456789 or email info@...

🏷️ Problems: Duplicates | Inconsistencies | Missing metadata | Poor structure


✅ After Our Preparation (chunk_001.json):

Clean structure. Metadata. Context. AI knows.

JSON
{
  // Vector searchable fields
  "searchableFields": {
    "rag_question": "What are the office hours of the city hall?",
    "content": "Office hours: Mon 8-17, Tue closed, Wed 8-12 and 13-17",
    "source_page_summary": "City Hall contact page",
    "current_chunk_summary": "Office opening hours",
    "overlap_summary": "...contact details and address"
  },
  // Filterable metadata
  "metadataFields": {
    "source_url": "city-hall.com/contact",
    "category": "office hours",
    "date_int": 20250115,
    "language": "en",
    "chunk_index": 3
  }
}

🏷️ Benefits: Optimized for RAG | Full metadata | No duplicates | Clear hierarchy


7

RAGus.ai: Our Solution to the Problem

What is RAGus.ai?

RAGus.ai is our specialized RAG-as-a-Service platform designed for:

  • AI agencies
  • Enterprise AI teams
  • RAG developers
  • No-code builders (Voiceflow, Botpress)

Key Features

FeatureDescription
Centralized dashboardManage all AI products in one place
Advanced analyticsConversation stats, detailed reporting
Integrated helpdeskEfficient query handling and escalation
Direct integrationOpenAI, Voiceflow, Pinecone, Qdrant
5 chunking strategiesIncluding agentic (LLM) chunking
Fast searchResponses in milliseconds
Enterprise securitySOC 2, bank-grade encryption

How RAGus.ai Solves Hallucinations

  1. Clean and structured data – Automatic cleaning and deduplication
  2. Automatic synchronization – Knowledge base is always current
  3. Response monitoring – Continuous improvement
  4. Feedback – Thumbs up/down for training

Result: 90%+ accuracy immediately, up to 99% within 3 months.

// RAGus.ai
Admin panel

Clean and Structured Data — the Core of Successful AI

A quality AI assistant is only as good as the data you feed it. RAGus.ai is our proprietary admin panel that serves as the central brain for all your AI products. It ensures your knowledge base is always up-to-date, clear, and accurate.

  • 99% accuracy through cleaned data
  • Central management of all AI products in one place
  • Automated knowledge base synchronization
  • Efficient monitoring and oversight of the AI 'brain'
RAGus.ai Dashboard
RAGus.ai Features
RAGus.ai Analytics
RAGus.ai Management
Admin panel
AI Management & Training
Conversation Monitoring
Feedback Entry

8

Two Paths: Service vs. Self-service

Professional Service (recommended)

For whom:

  • You don't have capacity for data preparation
  • You want guaranteed results without hassle
  • You appreciate expert guidance and support

Price: from 2,500 CZK/hour or from 15,000 CZK per data source

👉 Order turnkey data preparation

RAGus.ai Self-service

For whom:

  • You have a technical team and want control
  • You need automation and scaling
  • You're building your own AI products

Price: from $49.99/month

👉 Try RAGus.ai for free


9

Conclusion and Next Steps

Key Takeaways

  1. 95% of RAG projects fail due to poor data structure
  2. 4 main problems: Structure, Chunking, Metadata, Updates
  3. 5 chunking strategies – from simple to AI-driven
  4. Enhancement techniques dramatically increase accuracy
  5. RAGus.ai offers a complete solution from preparation to monitoring

Create Your "Second Brain" for AI

It doesn't matter where your data is or what format it's in. We'll connect everything into one cohesive place – a knowledge base from which AI draws.

No searching. No guessing. No hallucinations.


*Interested in a free consultation? Contact us and we'll show you how to create one unified source of truth from scattered data.*

Data Preparation RAG RAGus.ai AI Hallucinations Chunking Knowledge Base Vector Database

Want to achieve similar results?

Schedule a free consultation and discover how AI can transform your organization.

Free consultation
// CONTACT

Ready to Start YourAI Transformation?

Connect with HypeDigitaly and turn AI potential into your organization's competitive advantage.

Schedule a free consultation

30-minute call with no obligation

Prefer direct contact?

No obligation30min consultationBased in CZ