PageIndex: замена векторному поиску в RAG? / Хабр

Попытки заменить чем‑то векторный поиск в RAG продолжаются. Про GraphRAG я уже высказывался, новый претендент на замену — PageIndex.

Идея простая. Сегментируем документ на страницы, при помощи LLM и хитрого кода строим для него расширенную таблицу содержания, TOC в виде дерева узлов и саммари для каждого узла. Далее отправляем эту структуру в промпт поискового запроса и просим LLM найти релевантные узлы. За каждым найденным узлом закреплены страницы документа. Эти страницы достаём и используем в качестве контекста в финальном запросе.

Нет чанков, не нужны эмбеддинги и векторные хранилища. Выглядит заманчиво. Попытаюсь добавить к этой идее немного критики и заодно расскажу как эту штуку запустить локально.

Плюсы‑минусы

Посмотрим на иллюстрацию из readme:

Принцип работы Pageindex. Иллюстрация из официального репозитария.

Что не так, догадались? Правильно — документ всего один. Один! А что делать, если их сто, а если десятки тысяч? Авторами предлагаются следующие стратегии:

Поиск по метаданным. Метаданные — сильный аргумент, но их ещё нужно откуда‑то получить. Такой поиск обещают в ближайшей бета версии. Заметим, что фильтрация по метаданным реализована практически в любом приличном векторном хранилище.
Поиск по сгенерированному описанию документов. Эта опция вроде бы уже работает. Идея хорошая, но подходит для небольшого количества документов.
Семантический поиск. Всё, можно выдохнуть, векторный поиск снова востребован. А у вас есть эмбеддинг с достаточно большим контекстным окном, чтобы можно было любой документ превратить в вектор? В принципе можно искать и по TOC с саммари. Я оценил коэффициент сжатия контекста, получилось, что Pageindex строит расширенную таблицу содержания, которая в четыре‑пять раз меньше, чем исходный документ.

Что к этому можно добавить? Если не нравится возиться с эмбеддингом и векторным хранилищем, есть поиск по ключевым словам, по встречаемости в документах, любимый TFIDF и прочие bm25. Однако индексы всё‑таки лучше где‑то хранить, а не строить на лету, особенно в варианте с нормализацией терминов и стеммингом. Поэтому совсем без хранилища здесь сложно. С другой стороны, а где хранить TOC, если документов больше одного?

Думаю, будет справедливо сказать, что в PageIndex пока нет рабочей стратегии для обработки большой коллекции документов.

LLM уже давно используется в пайплайне RAG. Транскрипция картинок, управление коллекциями, адаптивная обработка документов, оценка качества. Список неполный. PageIndex добавляет в этот список замену поиска по сходству на поиск по релевантности. Когда мы работаем со сложным документом, мы сначала обращаем внимание на таблицу содержания, находим там релевантные разделы и уже потом обращаемся к самому тексту. В PageIndex реализована именно эта схема.

Retrieval is based on reasoning — traceable and interpretable, with page and section references. No more opaque, approximate vector search...

В итоге — высокая доля правильных ответов (заявлено 98,7% на FinanceBench) это и правда очень круто. Обратная сторона — ресурсоёмкость. Ризонинг вещь затратная. Допустим, что базовый векторный поиск в RAG, построенный со всеми известными эвристиками и хорошим эмбеддингом, даёт нам долю правильных ответов в районе 90%. Эти дополнительные 8,7% будут стоить в десятки раз дороже базы. Готов поверить, что для каких‑то проектов это неважно. Вопрос приоритетов.

Судя по конфигу в репозитарии, авторы экспериментировали с gpt-4o и gpt-5.4 Ну а я конечно же не удержался и попробовал запустить PageIndex на локальных моделях.

Эксперименты

Текст для экспериментов

По традиции я взял текст в жанре Cyberpunk. Выбрал рассказ Уильяма Гибсона «New Rose Hotel», но с ним Pageindex не справился, видимо потому что это повествование и в нём нет ни разделов, ни заголовков. Затем я решил попробовать академический текст. Это статья «A Cyberpunk 2077 perspective on the prediction and understanding of future technology» Miguel Bordallo López, Constantino Álvarez Casado, University of Oulu.

Варианты пайплайна

PageIndex понимает markdown и pdf. И это два разных паплайна. В случае с markdown текст разбивается на строки, фиксируются номера строк в которых содержится заголовок раздела. Текст сегментируется на фрагменты от строки с заголовком до следующей строки с заголовком. В принципе, всё работает и довольно быстро. Но мне кажется это не совсем честный подход. Точно также работает MarkdownTextSplitter из langchain. Предположу, что качество ответов будет примерно таким же как и у обычного векторного ретривера. А если заголовки в markdown не маркированы, структура текста будет состоять только из одного узла — названия файла документа.

С pdf всё сложнее, PageIndex умеет выделять заголовки, которые никак не помечены в тексте. Правда, не всегда и с кучей проблем на слабых моделях.

Проблема I. Пайплайн Pageindex падает из‑за ошибок вывода в JSON, поэтому текст статьи пришлось обработать. Я отрезал список литературы, убрал все кавычки из текста и преобразовал нумерованные списки в маркированные. Ещё можно попробовать подобрать модель.

Проблема II. Пайплайн Pageindex падает с исключением при недостаточной доле правильных ответов при построении структуры текста. Это accuracy, внутренний параметр PageIndex, в коде установлено пороговое значение в 60%.

В случае с pdf PageIndex разбивает текст на постраничные фрагменты, каждый фрагмент маркируется двумя индексами — номером первой страницы фрагмента и номером какой‑то следующей страницы. Ведь раздел может быть на несколько страниц. Accuracy же показывает сколько найденных узлов/заголовков действительно находятся в указанных фрагментах. Можно попробовать снизить этот параметр, там всё равно дальше по пайплайну коррекция ошибок. Сам порог задан в функции meta_processor модуля page_index.py. Я его вытащил в переменную окружения.

Проблема III. Почему собственно получается низкая доля правильных ответов? Здесь пришлось сильно углубиться в код. Причина оказалась в том, что связку узлов и фрагментов текста выполняет LLM. Если в слабую модель «запихнуть» весь контекст с фрагментами, она сбивается в нумерации. Для такого случая в коде заботливо предусмотрели группировку фрагментов на чуть менее крупные куски. Только вот работа этой группировки зависит от параметра max_tokens в функции page_list_to_group_text. А он по умолчанию установлен в 20 тысяч токенов и его можно поменять только в коде. Соответственно моему тексту в 17k токенов соответствовала всего одна группа текстов.

Модели

В начале пайплайна LLM пытается найти перечень разделов, выглядит это примерно так:

Parsing PDF...
start find_toc_pages
response {
    "thinking": "The text contains headings such as Highlights, Abstract, Keywords, and a single numbered heading \"1. Introduction\", but it does not present a list of sections or chapters that would constitute a table of contents. Therefore, there is no table of contents in the given text.",
    "toc_detected": "no"
}

gemma3:27b буквально на второй странице нашла таблицу содержания, которой в документе нет. Пришлось остановить.

gpt‑oss:20b на мой взгляд с задачей справилась, accuracy варьируется от 57 до 63%. Seed не помогает стабилизировать. Все примеры для этой статьи сделаны на gpt‑oss:20b.

qwen3:14b превзошла gpt‑oss:20b с accuracy в 69.57%, но пару раз «вылетала» из‑за json. И структура разделов у неё получается бедноватой.

Процесс на компьютере с GPU 16GB, не самом новом, занимает около получаса, несколько минут на пайплайн, остальное на коррекцию ошибок.

Как запустить

Внимание! Pageindex использует пакет liteLLM, в requirements указана версия 1.82.0. Убедитесь, что установлена именно эта версия. Старшие версии 1.82.7 и 1.82.8 скомпрометированы, ссылка

Не устанавливайте пакет pageindex через pip. Это версия с внешним АПИ, она не нужна для локальной работы.

Я подготовил два ноутбука: MinimalPageindexLocalPDF.ipynb и MinimalPageindexLocalMD.ipynb, они в корне репозитария. Часть кода позаимствовал из pageindex_RAG_simple.ipynb, это пример с внешним АПИ.

В модуль utils.py добавил переменные окружения OLLAMA_HOST и OLLAMA_TIMEOUT, они теперь используются в функциях llm_completion и llm_acompletion. В модуль page_index.py добавил возможность менять порог для accuracy через переменную окружения ACCURACY_THRESHOLD и максимальное количество токенов для группировки фрагментов текста — MAX_TOKENS. Собственно это всё, что нужно для запуска на локальной модели.

Вот такой TOC получился (accuracy 0.63)

{
  "doc_name": "2077.pdf",
  "structure": [
    {
      "title": "Introduction",
      "node_id": "0000",
      "start_index": 1,
      "end_index": 3,
      "summary": "The paper examines how the video game Cyberpunk 20...",
      "text": "A Cyberpunk 2077 perspective on the prediction and..."
    },
    {
      "title": "Literature review",
      "node_id": "0001",
      "start_index": 3,
      "end_index": 3,
      "summary": "The partial document presents a literature review ...",
      "text": "contextualizing Cyberpunk 2077 within the broader ...",
      "nodes": [
        {
          "title": "Intersection of science fiction and technological ...",
          "node_id": "0002",
          "start_index": 3,
          "end_index": 4,
          "summary": "The partial document surveys how science fiction, ...",
          "text": "contextualizing Cyberpunk 2077 within the broader ..."
        },
        {
          "title": "Science fiction and technological advancements",
          "node_id": "0003",
          "start_index": 4,
          "end_index": 5,
          "summary": "The partial document surveys how speculative media...",
          "text": "Incorporating decolonial perspectives into foresig..."
        },
        {
          "title": "Video games as a medium for future technologies",
          "node_id": "0004",
          "start_index": 5,
          "end_index": 5,
          "summary": "The partial document explores how science‑fiction ...",
          "text": "the authentication of information [45], [46]. Furt..."
        },
        {
          "title": "The unique contributions and predictive potential ...",
          "node_id": "0005",
          "start_index": 5,
          "end_index": 6,
          "summary": "The partial document examines how science‑fiction ...",
          "text": "the authentication of information [45], [46]. Furt..."
        },
        {
          "title": "Technologies in Cyberpunk 2077",
          "node_id": "0006",
          "start_index": 6,
          "end_index": 7,
          "summary": "The partial document examines Cyberpunk 2077 as a ...",
          "text": "with contemporary concerns. This cultural relevanc..."
        }
      ]
    },
    {
      "title": "Methodology",
      "node_id": "0007",
      "start_index": 7,
      "end_index": 7,
      "summary": "The excerpt outlines a research study that examine...",
      "text": "The specific technologies presented in Cyberpunk 2...",
      "nodes": [
        {
          "title": "Data collection",
          "node_id": "0008",
          "start_index": 7,
          "end_index": 8,
          "summary": "The document is a research article that examines t...",
          "text": "The specific technologies presented in Cyberpunk 2..."
        },
        {
          "title": "Thematic analysis",
          "node_id": "0009",
          "start_index": 8,
          "end_index": 8,
          "summary": "The excerpt outlines a research methodology for an...",
          "text": "envisioned and incorporated into the game. They pr..."
        },
        {
          "title": "Comparison with current technologies",
          "node_id": "0010",
          "start_index": 8,
          "end_index": 8,
          "summary": "The excerpt outlines a research methodology for an...",
          "text": "envisioned and incorporated into the game. They pr..."
        }
      ]
    },
    {
      "title": "Themes and technologies",
      "node_id": "0011",
      "start_index": 8,
      "end_index": 9,
      "summary": "The excerpt outlines a research study that analyze...",
      "text": "envisioned and incorporated into the game. They pr...",
      "nodes": [
        {
          "title": "Overarching themes",
          "node_id": "0012",
          "start_index": 9,
          "end_index": 9,
          "summary": "The partial document analyzes Cyberpunk 2077’s spe...",
          "text": "visionary future. Table 1, shows insights into the..."
        },
        {
          "title": "Human augmentation and cybernetic enhancements",
          "node_id": "0013",
          "start_index": 9,
          "end_index": 11,
          "summary": "The excerpt analyzes how the video game *Cyberpunk...",
          "text": "visionary future. Table 1, shows insights into the..."
        },
        {
          "title": "Brain–computer interfaces and simulated reality",
          "node_id": "0014",
          "start_index": 11,
          "end_index": 12,
          "summary": "The excerpt surveys the technological and ethical ...",
          "text": "Genetic Modifications  in Cyberpunk 2077 entail th..."
        },
        {
          "title": "Digital representation and information access",
          "node_id": "0015",
          "start_index": 12,
          "end_index": 13,
          "summary": "The excerpt surveys how Cyberpunk 2077’s speculati...",
          "text": "between the virtual and physical worlds. Haptics a..."
        },
        {
          "title": "Smart environments and personalization",
          "node_id": "0016",
          "start_index": 13,
          "end_index": 14,
          "summary": "The partial document surveys how the cyberpunk set...",
          "text": "Digital representation of the world as depicted in...",
          "nodes": [
            {
              "title": "Smart Mirrors and Fashion Applications",
              "node_id": "0017",
              "start_index": 14,
              "end_index": 14,
              "summary": "The excerpt discusses how Cyberpunk 2077’s persona...",
              "text": "Personalized smart environments in Cyberpunk 2077,..."
            },
            {
              "title": "Smart Appliances and Home Automation",
              "node_id": "0018",
              "start_index": 14,
              "end_index": 15,
              "summary": "The excerpt surveys how Cyberpunk 2077’s vision of...",
              "text": "Personalized smart environments in Cyberpunk 2077,..."
            }
          ]
        },
        {
          "title": "Autonomous vehicles and transportation",
          "node_id": "0019",
          "start_index": 15,
          "end_index": 15,
          "summary": "The excerpt outlines future research directions fo...",
          "text": "Improvements of self-driving models to handle out-...",
          "nodes": [
            {
              "title": "Autonomous Ground Vehicles",
              "node_id": "0020",
              "start_index": 15,
              "end_index": 15,
              "summary": "The excerpt outlines future research directions fo...",
              "text": "Improvements of self-driving models to handle out-..."
            },
            {
              "title": "Autonomous Flying Vehicles",
              "node_id": "0021",
              "start_index": 15,
              "end_index": 15,
              "summary": "The excerpt outlines future research priorities fo...",
              "text": "Improvements of self-driving models to handle out-..."
            },
            {
              "title": "Autonomous Delivery Robots",
              "node_id": "0022",
              "start_index": 15,
              "end_index": 15,
              "summary": "The excerpt outlines future research directions fo...",
              "text": "Improvements of self-driving models to handle out-..."
            }
          ]
        },
        {
          "title": "Advanced artificial intelligence and smart assista...",
          "node_id": "0023",
          "start_index": 15,
          "end_index": 15,
          "summary": "The excerpt outlines future research directions fo...",
          "text": "Improvements of self-driving models to handle out-...",
          "nodes": [
            {
              "title": "Self-aware AI systems",
              "node_id": "0024",
              "start_index": 15,
              "end_index": 16,
              "summary": "The partial document examines how Cyberpunk 2077’s...",
              "text": "Improvements of self-driving models to handle out-..."
            },
            {
              "title": "AI-based Personal Assistants",
              "node_id": "0025",
              "start_index": 16,
              "end_index": 16,
              "summary": "The partial document examines how Cyberpunk 2077 p...",
              "text": "portrayal of AI agents in Cyberpunk 2077 reflects ..."
            }
          ]
        }
      ]
    },
    {
      "title": "Discussion",
      "node_id": "0026",
      "start_index": 16,
      "end_index": 19,
      "summary": "The excerpt analyzes Cyberpunk 2077 as a speculati...",
      "text": "portrayal of AI agents in Cyberpunk 2077 reflects ..."
    },
    {
      "title": "Conclusion",
      "node_id": "0027",
      "start_index": 19,
      "end_index": 19,
      "summary": "The excerpt examines how Cyberpunk 2077 uses its n...",
      "text": "deeply integrated into its storytelling and gamepl..."
    },
    {
      "title": "CRediT authorship contribution statement",
      "node_id": "0028",
      "start_index": 19,
      "end_index": 19,
      "summary": "The excerpt examines how Cyberpunk 2077 uses its n...",
      "text": "deeply integrated into its storytelling and gamepl..."
    },
    {
      "title": "Declaration of Generative AI and AI-assisted techn...",
      "node_id": "0029",
      "start_index": 19,
      "end_index": 19,
      "summary": "The excerpt examines how Cyberpunk 2077 uses its n...",
      "text": "deeply integrated into its storytelling and gamepl..."
    }
  ]
}

Тестовый вопрос: "What model was used in preparing this article and why?"

Найденные узлы и рефлексия:

{
    "thinking": "The question asks which model was used to prepare the article and why. The conclusion section (node 0027) explicitly states that GPT‑4 was used for drafting and formatting. The same statement appears in the CRediT authorship contribution statement (node 0028) and the declaration of generative AI (node 0029). These nodes directly answer the question.",
    "node_list": ["0027", "0028", "0029"]
}

Сгенерированный ответ:

**Model used:**
The authors explicitly state that they employed **OpenAI’s GPT‑4** during the preparation of the
manuscript.

**Reason for using GPT‑4:**
- **Readability enhancement:** GPT‑4 was used to polish the prose, making the article clearer and
more accessible to readers.
- **Technical formatting assistance:** The model helped generate correct LaTeX code for references,
tables, and subsections, ensuring that the document complied with the required formatting standards.
- **Efficiency in drafting:** By automating routine writing tasks, GPT‑4 allowed the authors to
focus on higher‑level analysis and synthesis while still maintaining full editorial control over the
final content.

After the model’s output was incorporated, the authors reviewed, edited, and took full
responsibility for the final text, confirming that GPT‑4 served as an assistive tool rather than a
primary author.

Ещё один вопрос: "What does a dystopian future look like in this game?"

Найденные узлы и рефлексия

{
    "thinking": "The question asks for a description of the dystopian future depicted in the game. Relevant sections discuss corporate dominance, inequality, surveillance, and the societal impact of advanced technologies. Nodes 0005 and 0006 explicitly describe the game's dystopian setting and its themes. The Discussion (0026) and Conclusion (0027) sections elaborate on the dystopian aspects such as uneven tech adoption, corporate power, and social inequalities. Additionally, node 0013 (Human augmentation and cybernetic enhancements) touches on dystopian elements like cyberpsychosis and corporate exploitation. These nodes collectively provide the answer.",
    "node_list": ["0005", "0006", "0013", "0026", "0027"]
}

Сгенерированный ответ

**Dystopian Future in *Cyberpunk 2077* (as described in the context)**

The game portrays a future that is far from a single, utopian vision of technology. Instead, it
presents a **fragmented, unevenly‑adopted technological landscape** that amplifies social, cultural,
and ethical tensions. Key elements of this dystopia include:

| Feature | Description | How it contributes to the dystopia |
|---------|-------------|------------------------------------|
| **Uneven tech uptake** | Different districts of Night City adopt and adapt technology in vastly
different ways. | Highlights that progress is not universal; some areas thrive on high‑tech, others
lag, creating stark inequalities. |
| **Corporate dominance & surveillance** | Advanced biometric authentication and pervasive
monitoring are integrated into everyday life. | Corporations wield power over individuals, eroding
privacy and autonomy. |
| **Customization vs manipulation** | Players can tailor their character’s body and skills, but this
is framed within a system that can also manipulate and exploit those choices. | The promise of
personalization is countered by the risk of being used as a tool for control. |
| **Loss of autonomy** | The game’s mechanics and narrative emphasize how technology can strip
individuals of agency, even as it offers new capabilities. | Reflects a society where tech is both
empowering and oppressive. |
| **Cultural fragmentation** | Each district has a distinct cultural identity that shapes its
technological adaptation. | Breaks the “one‑size‑fits‑all” model, showing how tech can reinforce
existing divides rather than bridge them. |
| **Ethical and power dynamics** | The narrative critiques the assumption that technology inherently
improves society, instead showing its potential to exacerbate power imbalances. | Forces players to
confront the moral implications of tech adoption. |

In short, the dystopian future of *Cyberpunk 2077* is a neon‑lit, corporately‑controlled metropolis
where technology is both a tool for personal enhancement and a mechanism of surveillance and
manipulation. The uneven distribution of tech across culturally distinct districts underscores deep
social inequalities, while the constant tension between customization and loss of autonomy paints a
world where progress comes at a high ethical cost.

Инструкция по установке и запуску

1. Клонируйте репозитарий с моим форком
git clone https://github.com/khmelkoff/PageIndex\_local.git

2. Создайте и активируйте python окружение. Для conda:
conda create --name pageindex python==3.11
conda activate pageindex

3. Перейдите в папку Pageindex_local и установите необходимые пакеты
pip install -r requirements.txt

4. Проверьте версию пакета litellm, должна быть 1.82.0
pip show litellm

5. Установите jupyter notebook
pip install notebook

6. Запустите jupyter notebook и откройте файл MinimalPageindexLocalPDF.ipynb

7. Загрузите модель в ollama
ollama pull gpt-oss:20b

8. Остановите ollama. Установите переменные окружения и запустите снова - ollama serve
set OLLAMA_HOST=0.0.0.0:11434
set OLLAMA_CONTEXT_LENGTH=32000

9. В ноутбуке проверьте установку переменных окружения, в частности путь к ollama, и название модели в конфиге. Для litellm 1.82.0 префикс openai обязателен.

10. Текст для экспериментов находится в папке data, это 2077.pdf, там же есть вариант в маркдаун 2077.md и сохранены примеры сгенерированных TOC в формате pickle.

Вместо выводов

Код пайплайна неидеален и работает нестабильно, функций много, они похожи друг на друга и почти не документированы. Промпты прямо в коде, редактировать сложно. Для MVP нормально, для коммерческих целей придётся серьезно дорабатывать.

Думаю, что хороший результат может быть получен в гибридном варианте. Например, расширенный TOC в векторном поиске или векторный поиск в стратегии выбора документов. При этом, подход не является альтернативой для векторного RAG, уж точно не по эффективности. Ответов в реальном времени не будет, а пайплайн обработки документов останется всегда на порядок дороже векторного.