In this article, I would like to share my experience participating in the Agentic Legal RAG Challenge 2026 hackathon. Our team is called "Sparks of intelligence".
Original article in Russian

1. About the Competition

The competition was organized by EORA AI APPLICATIONS AND SERVICES. Task: to develop an application capable of accurately answering questions about documents from the Dubai International Financial Centre (DIFC) courts. Prize fund — $32,000. Number of participants — over 300.

The competition was held in two stages:

Stage

Dates

Number of Documents

Number of Questions

Warmup

March 11–19

30

100

Final

March 20–22

300

900

The organizers took the selection of questions seriously. The questions were diverse, and each had a specified answer type:

  • boolean (yes/no)

  • name (names, counterparties)

  • date

  • number (float)

  • free_text (text answer)

A comprehensive evaluation system was developed, including:

  • accuracy

  • speed

  • token consumption

All metrics are described in detail in the documentation with code examples. Free text was evaluated using an LLM (probably not strictly — it was enough to provide the correct facts).

More details — on the official competition website.

2. What Makes the Task Difficult?

The truth is out there…

Most likely, everyone reading this article is familiar with vector search to some extent. Many have faced the challenges of searching through a large number of documents.

Example:

Suppose you are asked to find a book with an apple pie recipe. A regular paper book, on a shelf — have you heard of those? - If you have a few dozen books on your shelf, you'll easily find the right one. You'll look for something like a "cookbook" or "recipe book". Roughly speaking, these are vectors. The search will take seconds. If the desired book is there, you'll find it, check the table of contents, and discover the recipe. Or you'll make sure it's not there. - If you have a solid home library, you'll spend much more time. But the mission is possible. - In a central library, you wouldn't have a chance to find the right book just by brute force. That's why people invented indexing: by authors, topics, year of publication, etc.

At the start of the competition, we had no experience developing serious RAG systems. It was time to figure it out.

3. Modern Vector Databases and RAG Approaches

I'd know it in a million … embeddings

During our research, we explored various document search methods. Even a brief description would be enough for a separate article: from semantic chunking to training LoRA models for each specific document.

In short, modern vector databases are arranged roughly as follows:

A) Search: Hybrid Method, Vectors + Best Match

  • A vector contains meaning. “Breaking of contract” and “cancellation of agreement” will have similar vectors.

  • However, finding, for example, “in which city/court was the case of Jason vs. Krueger heard” using vector search is unlikely to work.

B) Chunking: Splitting Texts into Semantic Fragments

Some sources say that good chunking is more than half the battle. If in one fragment is helter-skelter, the vector will not have a clear direction. It will be extremely difficult to find similar vectors (and thus the source of information).

A “naive” approach to chunking is splitting text into fragments by certain patterns: article 1, paragraph 2, item 4. If the document has a clear hierarchy, this approach may work. But the chunk size is extremely important. A chunk that's too small will have a clear vector, but the context will be lost.

Example:

Present at the meeting: - Participant 1 - Participant 2 - Participant 3

If you split by list marker, each item loses its meaning.

Possible solutions (from simple to complex):

Approach

Description

Pros

Cons

B1. Fixed size + overlap

Chunks of N tokens with overlap

Simple implementation

Risk of context break

B2. Hierarchical

Large chunks → small; search by small, context from large

Preserves context + accuracy

More complex to implement

B3. Semantic

Grouping by meaning using ML

Maximum relevance

Complexity, resource demands

Each solution has its pros and cons. There is no universal one. We didn't dive into semantic chunking — there wasn't enough time. We used options A and B.

C) Reranker

Evaluating vector similarity by cosine distance is rather skin-deep. The required context is not always in the chunks with the highest similarity.

Modern approaches involve using rerankers — special models trained on a huge number of question-answer pairs. They assess similarity better.

Instead of top-k closest chunks, we find top-k*10, then rerank and select the top-k.

4. Overview of Our Two Architecture Variants

After reviewing different solutions, the team implemented two variants: a simpler one and a more complex one.

For both, we chose the Qdrant vector database + LlamaIndex, with convenient methods for working with the vector database and LLM abstractions. This combination is common in the solutions we analyzed. Extracting text from PDF documents preserving the structure was performed using the Unstructured library.

Parameter

Solution 1 (Simple)

Solution 2 (Agentic)

Chunking

By pages + overlap

Hierarchical + LLM analysis

Search

Hybrid + metadata + regex

Agent-router → 4 tools

Reranker

Complexity

🟡 Medium

🔴 High

Speed

🟡 Medium

🔴 Low (2× LLM calls)

4.1 Hybrid Search + Metadata, Chunking by Pages

As simple as ground truth

Hybrid search, chunking by pages
Hybrid search, chunking by pages

Chunking was done by pages with overlap. This is a working solution that immediately solved the grounding issue. In the other solution, we had to set page break markers and then remove them. This caused problems with small, fragmented chunks, and we had to complicate the algorithm.

Regular expressions were used to search for patterns. The found values were used for filtering.

What worked:

  • Fast implementation

  • Predictable behavior

  • Grounding "out of the box" (chunk = page)

Problems:

  • Regex misses patterns or gives false positives

  • Hard binding to document structure

  • Initially, grounding was not loaded into metadata (fixed later)

4.2 An Agentic RAG System, Hierarchical chunking

99 ways to… get confused

Agentic RAG, hierarchical chunking
Agentic RAG, hierarchical chunking

The second variant was an attempt to build an agentic system + hierarchical chunking.

Chunking algorithm:

  1. Preliminary analysis of the structure of a random sample of documents using LLM

  2. Fixing patterns for splitting

  3. Recursively merging small chunks into larger ones until the desired size is reached

In fact, the algorithm did not work perfect. For different pages of the document, the same splitting pattern did different results. And there were many useless small chunks that were simply glued to the nearest ones without any system. Maybe somwhere there are lightweight models for chunking — we didn't find any. Or semantic search is needed.

Search algorithm: Arguments and metadata are selected by the LLM based on the context of the question. Four search tools were made: simple search by metadata, exact match search, document comparison, and hybrid search. Each can filter by case or law number.

Architecture: Agent-router → search tools → agent-generator. The router receives the user's question, a list of tools with descriptions, and instructions on which tool to use for what. The tools use various search methods in the vector database (vectors, keys, metadata), reranker. They output one or more text fragments to the answer generator. The generator processes this data and produces an answer in the required format.

After the fix, the tools were selected correctly. However, there was not enough time to fully test and tune the system.

What worked:

  • The router correctly selected tools based on the question context

  • Filtering by case/law number worked

Problems:

  • Chunking algorithm is unstable: one pattern → different behavior on different pages

  • There was "garbage" from small chunks glued without a system

  • Not enough time for full testing and debugging

5. Conclusions and Results

Our team cannot reach outstanding results. However, the results of the experiments and the description of the process may be of interest.

Solution 1 (simple)

At the warmup stage

  • Received the highest (of our variants) score for the accuracy of deterministic answers, where an exact value is specified.

  • Grounding was quite low. Most likely, the answers were simply taken from the metadata. When this was fixed, grounding increased, but accuracy decreased.

At the final stage

  • Achieved accuracy of 0.79 with grounding 0.63

Solution 2 (complex)

At the preliminary stage

  • Received the lowest accuracy score of all our attempts.

  • Speed was significantly lower due to 2 LLM calls.

  • The low accuracy was probably due to “rookie” mistakes (for example, we implemented functions with search tools but did not pass the tools schema to the agent — even in simple cases, hybrid search was performed).

At the final stage

  • Did not manage to submit the solution (API errors occurred at the last moment, then the submission closed).

  • By that time, many errors had been fixed. Subjectively, the search started working much better.

Solution

Stage

Accuracy Det

Grounding

Speed

Comment

Solution 1

Warmup

0.9 → 0.81*

0.5 → 0.58*

🟡 Medium

*After grounding fix, accuracy dropped slightly

Solution 1

Final

0.79

0.63

🟡 Medium

Stable, but not perfect

Solution 2

Warmup

0.74

0.6

🔴 Low

“Rookie” mistakes: tools not passed to agent

Solution 2

Final

— Not submitted (API errors + deadline)

Hopefully, thanks to this competition, we were able to improve our skills in vector search. In the future, we will try to take all mistakes into account.

I would be glad to discuss these and other RAG approaches in the comments.

P.S. According to the competition rules, the dataset and participants' code are closed. I cannot provide links.

P.P.S. Obviously, in such hard deadlines, it is almost impossible to build a system without code agents. However, they should be used with caution. You can trust an agent to make a wrapper for the database, some API where the input and output formats are clearly defined. When building complex algorithms or pipelines, even a clear task statement does not always lead to result. To achieve the goal, agents are ready for any tricks: replacing real methods with stubs, substituting "magic numbers" instead of real data, and even "optimizing" the task itself — for example, using regexes tailored to certain patterns instead of LLM. At the same time, all tests pass perfectly, and it can be extremely difficult to detect such cheating.