Agentic RAG Challenge. I Know What You Searched Last Summer… / Хабр

In this article, I would like to share my experience participating in the Agentic Legal RAG Challenge 2026 hackathon. Our team is called "Sparks of intelligence".
Original article in Russian

1. About the Competition

The competition was organized by EORA AI APPLICATIONS AND SERVICES. Task: to develop an application capable of accurately answering questions about documents from the Dubai International Financial Centre (DIFC) courts. Prize fund — $32,000. Number of participants — over 300.

The competition was held in two stages:

Stage	Dates	Number of Documents	Number of Questions
Warmup	March 11–19	30	100
Final	March 20–22	300	900

The organizers took the selection of questions seriously. The questions were diverse, and each had a specified answer type:

boolean (yes/no)
name (names, counterparties)
date
number (float)
free_text (text answer)

A comprehensive evaluation system was developed, including:

accuracy
speed
token consumption

All metrics are described in detail in the documentation with code examples. Free text was evaluated using an LLM (probably not strictly — it was enough to provide the correct facts).

More details — on the official competition website.

2. What Makes the Task Difficult?

The truth is out there…

Most likely, everyone reading this article is familiar with vector search to some extent. Many have faced the challenges of searching through a large number of documents.

Example:

Suppose you are asked to find a book with an apple pie recipe. A regular paper book, on a shelf — have you heard of those? - If you have a few dozen books on your shelf, you'll easily find the right one. You'll look for something like a "cookbook" or "recipe book". Roughly speaking, these are vectors. The search will take seconds. If the desired book is there, you'll find it, check the table of contents, and discover the recipe. Or you'll make sure it's not there. - If you have a solid home library, you'll spend much more time. But the mission is possible. - In a central library, you wouldn't have a chance to find the right book just by brute force. That's why people invented indexing: by authors, topics, year of publication, etc.

At the start of the competition, we had no experience developing serious RAG systems. It was time to figure it out.

3. Modern Vector Databases and RAG Approaches

I'd know it in a million … embeddings

During our research, we explored various document search methods. Even a brief description would be enough for a separate article: from semantic chunking to training LoRA models for each specific document.

In short, modern vector databases are arranged roughly as follows:

A) Search: Hybrid Method, Vectors + Best Match

A vector contains meaning. “Breaking of contract” and “cancellation of agreement” will have similar vectors.
However, finding, for example, “in which city/court was the case of Jason vs. Krueger heard” using vector search is unlikely to work.

B) Chunking: Splitting Texts into Semantic Fragments

Some sources say that good chunking is more than half the battle. If in one fragment is helter-skelter, the vector will not have a clear direction. It will be extremely difficult to find similar vectors (and thus the source of information).

A “naive” approach to chunking is splitting text into fragments by certain patterns: article 1, paragraph 2, item 4. If the document has a clear hierarchy, this approach may work. But the chunk size is extremely important. A chunk that's too small will have a clear vector, but the context will be lost.

Example:

Present at the meeting: - Participant 1 - Participant 2 - Participant 3

If you split by list marker, each item loses its meaning.

Possible solutions (from simple to complex):

Approach	Description	Pros	Cons
B1. Fixed size + overlap	Chunks of N tokens with overlap	Simple implementation	Risk of context break
B2. Hierarchical	Large chunks → small; search by small, context from large	Preserves context + accuracy	More complex to implement
B3. Semantic	Grouping by meaning using ML	Maximum relevance	Complexity, resource demands

Each solution has its pros and cons. There is no universal one. We didn't dive into semantic chunking — there wasn't enough time. We used options A and B.

C) Reranker

Evaluating vector similarity by cosine distance is rather skin-deep. The required context is not always in the chunks with the highest similarity.

Modern approaches involve using rerankers — special models trained on a huge number of question-answer pairs. They assess similarity better.

Instead of top-k closest chunks, we find top-k*10, then rerank and select the top-k.

4. Overview of Our Two Architecture Variants

After reviewing different solutions, the team implemented two variants: a simpler one and a more complex one.

For both, we chose the Qdrant vector database + LlamaIndex, with convenient methods for working with the vector database and LLM abstractions. This combination is common in the solutions we analyzed. Extracting text from PDF documents preserving the structure was performed using the Unstructured library.

Parameter	Solution 1 (Simple)	Solution 2 (Agentic)
Chunking	By pages + overlap	Hierarchical + LLM analysis
Search	Hybrid + metadata + regex	Agent-router → 4 tools
Reranker	✅	✅
Complexity	🟡 Medium	🔴 High
Speed	🟡 Medium	🔴 Low (2× LLM calls)

4.1 Hybrid Search + Metadata, Chunking by Pages

As simple as ground truth

Chunking was done by pages with overlap. This is a working solution that immediately solved the grounding issue. In the other solution, we had to set page break markers and then remove them. This caused problems with small, fragmented chunks, and we had to complicate the algorithm.

Regular expressions were used to search for patterns. The found values were used for filtering.

What worked:

Fast implementation
Predictable behavior
Grounding "out of the box" (chunk = page)

Problems:

Regex misses patterns or gives false positives
Hard binding to document structure
Initially, grounding was not loaded into metadata (fixed later)

4.2 An Agentic RAG System, Hierarchical chunking

99 ways to… get confused

The second variant was an attempt to build an agentic system + hierarchical chunking.

Chunking algorithm:

Preliminary analysis of the structure of a random sample of documents using LLM
Fixing patterns for splitting
Recursively merging small chunks into larger ones until the desired size is reached

In fact, the algorithm did not work perfect. For different pages of the document, the same splitting pattern did different results. And there were many useless small chunks that were simply glued to the nearest ones without any system. Maybe somwhere there are lightweight models for chunking — we didn't find any. Or semantic search is needed.

Search algorithm: Arguments and metadata are selected by the LLM based on the context of the question. Four search tools were made: simple search by metadata, exact match search, document comparison, and hybrid search. Each can filter by case or law number.

Architecture: Agent-router → search tools → agent-generator. The router receives the user's question, a list of tools with descriptions, and instructions on which tool to use for what. The tools use various search methods in the vector database (vectors, keys, metadata), reranker. They output one or more text fragments to the answer generator. The generator processes this data and produces an answer in the required format.

After the fix, the tools were selected correctly. However, there was not enough time to fully test and tune the system.

What worked:

The router correctly selected tools based on the question context
Filtering by case/law number worked

Problems:

Chunking algorithm is unstable: one pattern → different behavior on different pages
There was "garbage" from small chunks glued without a system
Not enough time for full testing and debugging

5. Conclusions and Results

Our team cannot reach outstanding results. However, the results of the experiments and the description of the process may be of interest.

Solution 1 (simple)

At the warmup stage

Received the highest (of our variants) score for the accuracy of deterministic answers, where an exact value is specified.
Grounding was quite low. Most likely, the answers were simply taken from the metadata. When this was fixed, grounding increased, but accuracy decreased.

At the final stage

Achieved accuracy of 0.79 with grounding 0.63

Solution 2 (complex)

At the preliminary stage

Received the lowest accuracy score of all our attempts.
Speed was significantly lower due to 2 LLM calls.
The low accuracy was probably due to “rookie” mistakes (for example, we implemented functions with search tools but did not pass the tools schema to the agent — even in simple cases, hybrid search was performed).

At the final stage

Did not manage to submit the solution (API errors occurred at the last moment, then the submission closed).
By that time, many errors had been fixed. Subjectively, the search started working much better.

Solution	Stage	Accuracy Det	Grounding	Speed	Comment
Solution 1	Warmup	0.9 → 0.81*	0.5 → 0.58*	🟡 Medium	*After grounding fix, accuracy dropped slightly
Solution 1	Final	0.79	0.63	🟡 Medium	Stable, but not perfect
Solution 2	Warmup	0.74	0.6	🔴 Low	“Rookie” mistakes: tools not passed to agent
Solution 2	Final	—	—	—	— Not submitted (API errors + deadline)

Hopefully, thanks to this competition, we were able to improve our skills in vector search. In the future, we will try to take all mistakes into account.

I would be glad to discuss these and other RAG approaches in the comments.

P.S. According to the competition rules, the dataset and participants' code are closed. I cannot provide links.

P.P.S. Obviously, in such hard deadlines, it is almost impossible to build a system without code agents. However, they should be used with caution. You can trust an agent to make a wrapper for the database, some API where the input and output formats are clearly defined. When building complex algorithms or pipelines, even a clear task statement does not always lead to result. To achieve the goal, agents are ready for any tricks: replacing real methods with stubs, substituting "magic numbers" instead of real data, and even "optimizing" the task itself — for example, using regexes tailored to certain patterns instead of LLM. At the same time, all tests pass perfectly, and it can be extremely difficult to detect such cheating.