LLM Judge для валидации образовательного контента: архитектура кросс-модельной оценки с бюджетом $0.014 за курс / Хабр

Как мы решили проблему "стохастической дивергенции" при генерации уроков и снизили затраты на валидацию в 17,000 раз по сравнению с ручной проверкой

Контекст: кто пишет и о чем эта статья

Игорь Масленников. В IT с 2013 года. Последние два года развиваю AI Dev Team в DNA IT — подразделение, которое работает на мульти-модельной архитектуре. Мы генерируем образовательные курсы для клиентов с бюджетом 0.50 за курс (10-30 уроков).

Статья для тех, кто:

Строит AI-системы для генерации контента и упирается в проблему качества
Хочет понять, как использовать LLM для оценки других LLM без эффекта "эхо-камеры"
Ищет конкретные алгоритмы детекции галлюцинаций без дорогого RAG-контекста
Интересуется cost engineering для AI-пайплайнов

Что внутри: архитектура кросс-модельной валидации, алгоритм CLEV для консенсусного голосования, энтропийная детекция галлюцинаций, трансляция образовательных рубрик OSCQR в машиночитаемые промпты, circuit breaker для итеративных циклов исправления.

Проблема: почему валидация на этапе спецификации недостаточна

Когда мы построили пайплайн генерации образовательных курсов с архитектурой Hybrid Map-Reduce-Refine, первый вопрос был: "Достаточно ли валидировать спецификацию урока (Stage 5), или нужна отдельная валидация сгенерированного контента (Stage 6)?"

Гипотеза была простой: если спецификация корректна (Learning Objectives валидированы по Bloom's Taxonomy, структура курса проверена), то и контент будет качественным.

Гипотеза оказалась ложной.

Стохастическая дивергенция

LLM — это вероятностная машина. Даже с temperature=0.0 модель навигирует по латентному пространству, которое может содержать фактические ошибки из pre-training данных.

Пример из нашей практики:

Спецификация: Урок о ньютоновской механике
             Hook strategy: Historical Analogy
             Depth: Beginner/5th Grade

Stage 5 валидация: PASSED (структура корректна)

Сгенерированный контент: "...Исаак Ньютон открыл закон гравитации
после того, как на его голову упал арбуз..."

Stage 6 валидация: FAILED (Faithfulness Hallucination)

Спецификация была идеальной. Выполнение — нет. Это Faithfulness Hallucination — модель отклонилась от мировых знаний несмотря на корректные инструкции.

Педагогический дрифт

Вторая проблема — Pedagogical Drift. Образовательный контент требует калибровки сложности. Спецификация может указать Depth: Beginner/5th Grade, но модель, обученная на корпусе интернета, имеет тенденцию "дрифтить" к средней сложности (уровень статьи в Википедии).

// Типичная картина педагогического дрифта
interface PedagogicalDrift {
  introduction: {
    fleschKincaid: 5.2,  // Соответствует спецификации
    tone: 'engaging',
  };
  body: {
    fleschKincaid: 8.7,  // Дрифт к средней сложности
    tone: 'academic',    // Потеря engagement
  };
  conclusion: {
    fleschKincaid: 9.1,  // Еще дальше от цели
    tone: 'dry',
  };
}

Stage 5 не может это детектировать — дрифт происходит динамически во время генерации токенов.

Lost in the Middle

При больших контекстах (RAG-контекст + спецификация + предыдущие секции) модели страдают от "Lost in the Middle" феномена — информация в середине контекста игнорируется. Это приводит к:

Игнорированию критических требований из спецификации
Несоответствию между секциями урока
Потере терминологической консистентности

Вывод: Stage 6 валидация обязательна. Вопрос — как её архитектурно реализовать с бюджетом 0.05 на урок.

Архитектура: кросс-модельная оценка

Self-Preference Bias: почему модель не должна судить сама себя

Критическое открытие из исследований: LLM демонстрируют статистически значимое предпочтение к тексту, сгенерированному моделями своего семейства.

Количественные данные:

GPT-4 судит GPT-4: +10% win rate для собственных выходов
Claude судит Claude: +25% win rate (самый сильный bias)
GPT-3.5: Минимальный self-preference (исключение)

Корневая причина: Perplexity-based familiarity. Модели предпочитают выходы с низкой perplexity (более знакомые паттерны), независимо от фактического качества.

// Демонстрация self-preference bias
interface SelfPreferenceBias {
  // Qwen3-235B генерирует, Qwen3-235B судит
  sameFamily: {
    averageScore: 8.7,    // Искусственно завышено
    passRate: 0.92,       // Много false positives
    hallucinations: 0.15, // Пропущенные галлюцинации
  };

  // Qwen3-235B генерирует, DeepSeek Terminus судит
  crossFamily: {
    averageScore: 7.9,    // Реалистичная оценка
    passRate: 0.78,       // Адекватный порог
    hallucinations: 0.04, // Детектированы проблемы
  };
}

Архитектурное решение: Генератор и Judge должны быть из разных семейств моделей.

Выбор Judge-модели для бюджета

Для нашего бюджета (0.05 за урок):

Модель	Input/1M	Output/1M	Cost/урок (3x voting)	MMLU
Gemini 1.5 Flash	$0.075	$0.30	$0.00195	78%
GPT-4o-mini Batch	$0.075	$0.30	$0.00195	82%
Claude Haiku 3	$0.25	$1.25	$0.00675	75%

Выбор: Gemini Flash (primary) + GPT-4o-mini (secondary) + Claude Haiku (tiebreaker).

Temperature: 0.1, не 0.0

Исследования показывают неочевидный результат:

Temperature	Self-consistency	Human alignment	Score distribution
0.0	98-99%	78-80%	Депрессия (занижение)
0.1	95-97%	80-82%	Сбалансированная
0.3+	70-85%	75-80%	Высокая variance

T=0.1 — оптимальный баланс между консистентностью и калибровкой скоров.

CLEV: Consensus via Lightweight Efficient Voting

Проблема с 3x voting

Предложение использовать 3x voting для каждого урока — brute-force решение. В 80% случаев урок либо явно качественный, либо явно плохой. Тратить 3x API-вызова на подтверждение очевидного — неэффективно.

Алгоритм CLEV

Идея: Начинаем с 2 judges. 3-й вызывается только при разногласии.

// src/evaluation/clev.ts
interface CLEVConfig {
  primaryJudge: 'gemini-flash';
  secondaryJudge: 'gpt-4o-mini';
  tiebreakerJudge: 'claude-haiku';
  agreementThreshold: 0.15; // Разница скоров для согласия
  temperature: 0.1;
}

interface JudgeResult {
  score: number;           // 0.0-1.0
  confidence: 'high' | 'medium' | 'low';
  reasoning: string;
  criteriaScores: Record;
  issues: Issue[];
}

async function clevEvaluate(
  lesson: LessonContent,
  spec: LessonSpecification,
  config: CLEVConfig
): Promise {
  // Stage 1: Два параллельных judge-вызова
  const [judge1Result, judge2Result] = await Promise.all([
    evaluateWithModel(config.primaryJudge, lesson, spec, config.temperature),
    evaluateWithModel(config.secondaryJudge, lesson, spec, config.temperature),
  ]);

  // Проверяем согласие
  const scoreDiff = Math.abs(judge1Result.score - judge2Result.score);
  const categoricalMatch =
    getCategory(judge1Result.score) === getCategory(judge2Result.score);

  // Case 1: Согласие (70-85% случаев)
  if (scoreDiff &lt;= config.agreementThreshold &amp;&amp; categoricalMatch) {
    return {
      finalScore: weightedAverage(judge1Result, judge2Result),
      verdict: getVerdict(judge1Result.score),
      confidence: 'high',
      votesUsed: 2,
      cost: calculateCost(2),
      judges: [judge1Result, judge2Result],
    };
  }

  // Case 2: Разногласие — вызываем tiebreaker (15-30% случаев)
  const judge3Result = await evaluateWithModel(
    config.tiebreakerJudge,
    lesson,
    spec,
    config.temperature
  );

  return {
    finalScore: majorityVote([judge1Result, judge2Result, judge3Result]),
    verdict: getVerdict(majorityVote([...])),
    confidence: 'medium',
    votesUsed: 3,
    cost: calculateCost(3),
    judges: [judge1Result, judge2Result, judge3Result],
  };
}

// Weighted average с учетом исторической точности
function weightedAverage(j1: JudgeResult, j2: JudgeResult): number {
  const weights = {
    'gemini-flash': 0.70,
    'gpt-4o-mini': 0.75,
    'claude-haiku': 0.72,
  };

  const w1 = weights[j1.model] || 0.5;
  const w2 = weights[j2.model] || 0.5;

  return (j1.score * w1 + j2.score * w2) / (w1 + w2);
}

// Категоризация скоров
function getCategory(score: number): 'excellent' | 'good' | 'fair' | 'poor' {
  if (score &gt;= 0.90) return 'excellent';
  if (score &gt;= 0.75) return 'good';
  if (score &gt;= 0.60) return 'fair';
  return 'poor';
}

// Majority vote для 3 judges
function majorityVote(judges: JudgeResult[]): number {
  const categories = judges.map(j =&gt; getCategory(j.score));
  const counts = categories.reduce((acc, cat) =&gt; {
    acc[cat] = (acc[cat] || 0) + 1;
    return acc;
  }, {} as Record);

  // Если есть категория с 2+ голосами — используем её
  const majorityCategory = Object.entries(counts)
    .find(([_, count]) =&gt; count &gt;= 2)?.[0];

  if (majorityCategory) {
    const majorityJudges = judges.filter(
      j =&gt; getCategory(j.score) === majorityCategory
    );
    return majorityJudges.reduce((sum, j) =&gt; sum + j.score, 0)
           / majorityJudges.length;
  }

  // Нет majority — берем median
  const sorted = judges.map(j =&gt; j.score).sort((a, b) =&gt; a - b);
  return sorted[1]; // Median из 3
}

Экономия от CLEV

Подход	Cost/урок	При 20 уроках	При 100 курсах/мес
3x voting always	$0.00585	$0.117	$11.70
CLEV	$0.00234	$0.047	$4.68
Экономия	60%	60%	$7.02/мес

CLEV снижает затраты на 60% при сохранении 85% качества валидации.

OSCQR Рубрика: трансляция образовательных стандартов в промпты

Что такое OSCQR

OSCQR (Open SUNY Course Quality Review) — индустриальный стандарт для оценки качества онлайн-курсов. 50 стандартов, охватывающих педагогику, доступность, вовлечение.

Проблема: OSCQR написан для человеческой оценки. LLM нужны машиночитаемые критерии.

Трансляция стандартов в промпт-критерии

// src/evaluation/oscqr-translation.ts

interface OSCQRCriteria {
  standard: number;
  humanDescription: string;
  llmTranslation: {
    checkFor: string;
    prompt: string;
    scoringLogic: string;
  };
}

const OSCQR_TRANSLATIONS: OSCQRCriteria[] = [
  // Standard 2: Learning Objectives
  {
    standard: 2,
    humanDescription:
      "Learning objectives are measurable and aligned with course goals",
    llmTranslation: {
      checkFor: 'Bloom\'s Taxonomy verb presence and measurability',
      prompt: `
        Extract key concepts taught in this lesson.
        Compare semantically to the Learning Objectives in specification.
        Calculate overlap percentage.
        Check for Bloom's action verbs (remember, understand, apply,
        analyze, evaluate, create).
      `,
      scoringLogic: `
        1.0: All objectives addressed with explicit Bloom's verbs
        0.8: 80%+ objectives addressed
        0.6: 60%+ objectives addressed
        0.4: 40%+ objectives addressed
        0.0: &lt;40% or no measurable outcomes
      `,
    },
  },

  // Standard 19: Instructions Clarity
  {
    standard: 19,
    humanDescription:
      "Instructions make clear how to get started and find components",
    llmTranslation: {
      checkFor: 'Transition signals and explicit next-step instructions',
      prompt: `
        Identify transition signals between Introduction and Body.
        Check: Are instructions for student's next step explicit?
        Look for: "First...", "Next...", "Complete the following..."
      `,
      scoringLogic: `
        1.0: Clear transitions + explicit instructions
        0.7: Transitions present, instructions implicit
        0.4: Weak transitions, no clear instructions
        0.0: No structural guidance
      `,
    },
  },

  // Standard 30: Higher Order Thinking
  {
    standard: 30,
    humanDescription:
      "Course provides activities for higher-order thinking: critical reflection",
    llmTranslation: {
      checkFor: 'Cognitive activators and application prompts',
      prompt: `
        Does lesson include at least one:
        - Open-ended question requiring analysis?
        - Reflective prompt asking for personal application?
        - Problem to solve (not just definition)?
        Count instances of each. Score based on presence and quality.
      `,
      scoringLogic: `
        1.0: 3+ high-quality cognitive activators
        0.8: 2 activators or 1 exceptional
        0.6: 1 basic activator
        0.3: Attempts at activators, poorly executed
        0.0: Pure information delivery, no activation
      `,
    },
  },

  // Standard 31: Real-World Applications
  {
    standard: 31,
    humanDescription:
      "Course provides activities emulating real-world applications",
    llmTranslation: {
      checkFor: 'Analogies, case studies, practical examples',
      prompt: `
        Does lesson employ:
        - Real-world analogy to explain core concept?
        - Case study from industry/practice?
        - Concrete example with specific details (names, numbers, context)?
        Score 0 if explanation is purely abstract.
      `,
      scoringLogic: `
        1.0: Multiple concrete real-world examples
        0.7: At least one strong example/analogy
        0.4: Weak or generic examples
        0.0: Abstract explanations only
      `,
    },
  },

  // Standard 34: Text Accessibility
  {
    standard: 34,
    humanDescription: "Text should be readable at appropriate level",
    llmTranslation: {
      checkFor: 'Flesch-Kincaid compliance with target audience',
      prompt: `
        Estimate Flesch-Kincaid Grade Level of text.
        Compare to target audience from specification.
        Flag if deviation &gt; 1 grade level.
        Check for: unexplained jargon, overly complex sentences.
      `,
      scoringLogic: `
        1.0: Within target grade level
        0.7: +1 grade level deviation
        0.4: +2 grade levels deviation
        0.0: +3 or more grade levels deviation
      `,
    },
  },
];

Weighted Hierarchical Rubric

Не все критерии равнозначны. Factual Integrity важнее Engagement — урок с неправильными фактами опасен, скучный урок просто менее эффективен.

// src/evaluation/weighted-rubric.ts

interface WeightedRubric {
  criterion: string;
  weight: number;
  criticalFailure: boolean; // Если true и score &lt; threshold — VETO
  criticalThreshold: number;
  oscqrStandards: number[];
}

const WEIGHTED_RUBRIC: WeightedRubric[] = [
  {
    criterion: 'factual_integrity',
    weight: 0.35,
    criticalFailure: true,
    criticalThreshold: 0.60,
    oscqrStandards: [], // Фундаментальный критерий, не из OSCQR
  },
  {
    criterion: 'pedagogical_alignment',
    weight: 0.25,
    criticalFailure: true,
    criticalThreshold: 0.50,
    oscqrStandards: [2, 30],
  },
  {
    criterion: 'clarity_structure',
    weight: 0.20,
    criticalFailure: false,
    criticalThreshold: 0,
    oscqrStandards: [19, 37],
  },
  {
    criterion: 'engagement_tone',
    weight: 0.20,
    criticalFailure: false,
    criticalThreshold: 0,
    oscqrStandards: [31, 34],
  },
];

// Вычисление финального скора с учетом VETO
function calculateWeightedScore(
  criteriaScores: Record
): { score: number; vetoed: boolean; vetoReason?: string } {
  // Проверка критических провалов (VETO)
  for (const rubric of WEIGHTED_RUBRIC) {
    if (rubric.criticalFailure) {
      const score = criteriaScores[rubric.criterion];
      if (score &lt; rubric.criticalThreshold) {
        return {
          score: score,
          vetoed: true,
          vetoReason: `${rubric.criterion} below critical threshold: ` +
                      `${score} &lt; ${rubric.criticalThreshold}`,
        };
      }
    }
  }

  // Weighted sum
  const totalWeight = WEIGHTED_RUBRIC.reduce((sum, r) =&gt; sum + r.weight, 0);
  const weightedSum = WEIGHTED_RUBRIC.reduce((sum, rubric) =&gt; {
    return sum + (criteriaScores[rubric.criterion] || 0) * rubric.weight;
  }, 0);

  return {
    score: weightedSum / totalWeight,
    vetoed: false,
  };
}

JSON Output Schema

// src/evaluation/judge-output-schema.ts

interface JudgeOutput {
  evaluation_id: string;
  overall_score: number;          // 0.0-1.0
  verdict: 'PASS' | 'FAIL' | 'NEEDS_REVISION';
  vetoed: boolean;
  veto_reason?: string;

  dimensions: {
    factual_integrity: DimensionScore;
    pedagogical_alignment: DimensionScore;
    clarity_structure: DimensionScore;
    engagement_tone: DimensionScore;
  };

  issues: Issue[];
  strengths: string[];
  fix_recommendation: string;
}

interface DimensionScore {
  score: number;
  reasoning: string;
  evidence: string[];
}

interface Issue {
  criterion: string;
  severity: 'critical' | 'high' | 'medium' | 'low';
  location: string;        // "section 2, paragraph 3"
  description: string;
  suggested_fix: string;
}

// Пример реального output
const exampleOutput: JudgeOutput = {
  evaluation_id: "eval_lesson_042",
  overall_score: 0.82,
  verdict: "NEEDS_REVISION",
  vetoed: false,

  dimensions: {
    factual_integrity: {
      score: 0.90,
      reasoning: "No hallucinations detected. Claims align with RAG context.",
      evidence: [
        "Dates and names verified against source",
        "Mathematical formulas correct"
      ],
    },
    pedagogical_alignment: {
      score: 0.80,
      reasoning: "Covers 2/3 objectives. Missing 'application' objective.",
      evidence: [
        "Objective 1: 'Define key terms' - COVERED",
        "Objective 2: 'Explain relationships' - COVERED",
        "Objective 3: 'Apply to real scenario' - NOT FOUND"
      ],
    },
    clarity_structure: {
      score: 0.85,
      reasoning: "Good transitions, clear structure.",
      evidence: ["Clear intro-body-conclusion flow"],
    },
    engagement_tone: {
      score: 0.65,
      reasoning: "Tone is academic. Lacks analogies or hook.",
      evidence: [
        "No real-world examples in section 2",
        "Hook in intro is weak"
      ],
    },
  },

  issues: [
    {
      criterion: "engagement_tone",
      severity: "medium",
      location: "introduction, paragraph 1",
      description: "Hook is weak and unrelated to topic",
      suggested_fix: "Rewrite intro with compelling analogy " +
                     "connecting to target audience experience",
    },
    {
      criterion: "pedagogical_alignment",
      severity: "high",
      location: "entire lesson",
      description: "Objective 3 (application) not addressed",
      suggested_fix: "Add section with practical exercise " +
                     "demonstrating real-world application",
    },
  ],

  strengths: [
    "Excellent factual accuracy",
    "Clear logical progression",
    "Appropriate reading level for target audience",
  ],

  fix_recommendation:
    "Add real-world analogy to introduction. " +
    "Create new section 4 with practical exercise for Objective 3.",
};

Reference-Free Hallucination Detection: энтропия токенов

Проблема: RAG-контекст дорогой

Для проверки фактической точности Judge идеально нужен RAG-контекст (источники, на которых базируется урок). Но передача 3,000+ токенов RAG-контекста для каждого урока:

Увеличивает стоимость в 2-4x
Усугубляет "Lost in the Middle" проблему
Замедляет inference

Идея: Uncertainty Quantification via Log-Probabilities

Когда LLM галлюцинирует, её внутренняя уверенность часто снижается, даже если сгенерированный текст выглядит уверенно. Распределение вероятностей токенов имеет более высокую энтропию при конфабуляции.

Математика

Entropy для sentence S:

H(S) = -Σ p(x) * log(p(x))

где p(x) — вероятность токена x в позиции.

Высокая энтропия = модель "не уверена" какой токен выбрать
Низкая энтропия = модель "уверена" в выборе

Реализация

// src/evaluation/entropy-hallucination-detector.ts

interface TokenLogprob {
  token: string;
  logprob: number;
  topLogprobs: { token: string; logprob: number }[];
}

interface EntropyAnalysis {
  sentence: string;
  sentenceIndex: number;
  entropy: number;
  hasFactualClaim: boolean;
  flaggedAsRisk: boolean;
  riskReason?: string;
}

// Основная функция детекции
async function detectHallucinationRisk(
  generatedContent: string,
  tokenLogprobs: TokenLogprob[]
): Promise {
  const sentences = splitIntoSentences(generatedContent);
  const analyses: EntropyAnalysis[] = [];

  let tokenIndex = 0;

  for (let i = 0; i &lt; sentences.length; i++) {
    const sentence = sentences[i];
    const sentenceTokens = tokenize(sentence);

    // Собираем logprobs для токенов этого предложения
    const sentenceLogprobs = tokenLogprobs.slice(
      tokenIndex,
      tokenIndex + sentenceTokens.length
    );
    tokenIndex += sentenceTokens.length;

    // Вычисляем энтропию предложения
    const entropy = calculateSentenceEntropy(sentenceLogprobs);

    // Детектируем фактические claims (NER)
    const hasFactualClaim = detectFactualClaims(sentence);

    // Флагируем риск: высокая энтропия + фактический claim
    const flaggedAsRisk =
      entropy &gt; ENTROPY_THRESHOLD &amp;&amp; hasFactualClaim;

    analyses.push({
      sentence,
      sentenceIndex: i,
      entropy,
      hasFactualClaim,
      flaggedAsRisk,
      riskReason: flaggedAsRisk
        ? `High entropy (${entropy.toFixed(3)}) on factual claim`
        : undefined,
    });
  }

  return {
    totalSentences: sentences.length,
    flaggedSentences: analyses.filter(a =&gt; a.flaggedAsRisk).length,
    analyses,
    requiresRagValidation: analyses.some(a =&gt; a.flaggedAsRisk),
    flaggedIndices: analyses
      .filter(a =&gt; a.flaggedAsRisk)
      .map(a =&gt; a.sentenceIndex),
  };
}

// Entropy calculation с использованием top logprobs
function calculateSentenceEntropy(logprobs: TokenLogprob[]): number {
  if (logprobs.length === 0) return 0;

  let totalEntropy = 0;

  for (const tokenData of logprobs) {
    // Используем top-5 logprobs для оценки распределения
    const probs = tokenData.topLogprobs.map(lp =&gt; Math.exp(lp.logprob));
    const sumProbs = probs.reduce((a, b) =&gt; a + b, 0);
    const normalizedProbs = probs.map(p =&gt; p / sumProbs);

    // Shannon entropy
    const entropy = -normalizedProbs.reduce((sum, p) =&gt; {
      return p &gt; 0 ? sum + p * Math.log2(p) : sum;
    }, 0);

    totalEntropy += entropy;
  }

  return totalEntropy / logprobs.length; // Средняя энтропия
}

// NER для детекции фактических claims
function detectFactualClaims(sentence: string): boolean {
  const factualPatterns = [
    // Даты
    /\b(в\s+)?\d{4}\s*(году|г\.)/i,
    /\b\d{1,2}\s+(января|февраля|марта|апреля|мая|июня|июля|августа|сентября|октября|ноября|декабря)/i,

    // Числа с единицами
    /\b\d+(\.\d+)?\s*(процент|%|млн|тыс|км|м|кг|г)\b/i,

    // Имена собственные (простая эвристика)
    /\b[А-ЯЁ][а-яё]+\s+[А-ЯЁ][а-яё]+\b/, // Иван Петров

    // Организации
    /\b(компания|организация|институт|университет)\s+[А-ЯЁ"«]/i,

    // Утверждения с "является", "составляет", "равен"
    /\b(является|составляет|равен|равно|был|была|были)\b/i,

    // Цитаты
    /["«][^"»]+["»]\s*[-—]\s*[А-ЯЁ]/,
  ];

  return factualPatterns.some(pattern =&gt; pattern.test(sentence));
}

// Threshold calibrated на нашем датасете
const ENTROPY_THRESHOLD = 0.8; // Выше = risk

Conditional RAG Strategy

// src/evaluation/conditional-rag.ts

async function evaluateWithConditionalRag(
  lesson: LessonContent,
  spec: LessonSpecification,
  ragContext: string | null
): Promise {
  // Step 1: Baseline evaluation (без RAG)
  const baselineResult = await clevEvaluate(lesson, spec);

  // Step 2: Entropy analysis (во время генерации, бесплатно)
  const entropyReport = await detectHallucinationRisk(
    lesson.content,
    lesson.tokenLogprobs // Сохранены при генерации
  );

  // Step 3: Conditional RAG check
  if (entropyReport.requiresRagValidation &amp;&amp; ragContext) {
    // Только для flagged sentences
    const flaggedText = entropyReport.flaggedIndices
      .map(i =&gt; lesson.sentences[i])
      .join('\n');

    const ragValidation = await validateWithRag(
      flaggedText,
      ragContext
    );

    // Adjust factual_integrity score
    if (ragValidation.hallucinations.length &gt; 0) {
      baselineResult.dimensions.factual_integrity.score *= 0.5;
      baselineResult.issues.push(...ragValidation.hallucinations.map(h =&gt; ({
        criterion: 'factual_integrity',
        severity: 'critical' as const,
        location: h.location,
        description: `Hallucination detected: ${h.claim}`,
        suggested_fix: `Replace with: ${h.correction}`,
      })));
    }
  }

  return recalculateOverallScore(baselineResult);
}

Ограничения метода

Что детектируем:

Confabulations — ошибки из-за неуверенности (высокая энтропия)
Statistical anomalies — токены с необычно высокой entropy

Что НЕ детектируем:

Confident misconceptions — модель уверенно ошибается (training data bias)
Subtle factual errors — даты, числа, которые модель "запомнила" неправильно

ROI при нашем бюджете: Entropy-based filtering → Conditional RAG только для 15-20% контента → 60-70% экономия на RAG-вызовах.

Targeted Self-Refinement: исправление без полной регенерации

Проблема с regeneration

Когда Judge возвращает score < 0.75, naive-решение — перегенерировать весь урок. Это:

Отбрасывает успешные части контента
Стоит как полная генерация (2000 output tokens)
Не гарантирует улучшение (новый random seed ≠ лучше)

Critique-and-Correct Loop

Исследования показывают: LLM значительно лучше улучшают контент по конкретному feedback, чем генерируют идеально с нуля.

// src/refinement/targeted-fix.ts

interface FixContext {
  originalContent: string;
  judgeIssues: Issue[];
  judgeStrengths: string[];
  preserveSections: string[];
  terminologyGlossary: Map;
}

// Template 1: Structured Feedback Refinement (score 0.60-0.75)
function buildStructuredFixPrompt(ctx: FixContext): string {
  return `
You previously generated educational content that scored below threshold.

ORIGINAL CONTENT:
${ctx.originalContent}

JUDGE FEEDBACK:
${JSON.stringify(ctx.judgeIssues, null, 2)}

TASK: Revise content to address all issues while preserving successful elements.

PRESERVE EXACTLY (do not modify):
${ctx.preserveSections.map(s =&gt; `- ${s}`).join('\n')}

SPECIFIC REVISIONS NEEDED:
${ctx.judgeIssues.map((issue, i) =&gt; `
${i + 1}. ${issue.criterion}: ${issue.description}
   Location: ${issue.location}
   Fix: ${issue.suggested_fix}
`).join('\n')}

MAINTAIN:
- Learning objective alignment
- Consistent terminology: ${[...ctx.terminologyGlossary.entries()].map(([k, v]) =&gt; `"${k}" = ${v}`).join(', ')}
- Same pedagogical approach (Bloom's level)
- Transitions with surrounding content

Provide ONLY the revised content, maintaining the same overall structure.
`.trim();
}

// Template 2: Targeted Section Fix (score 0.75-0.90)
function buildTargetedSectionFixPrompt(
  fullContent: string,
  sectionToFix: string,
  issue: Issue,
  surroundingContext: { before: string; after: string }
): string {
  return `
The following lesson content scored well overall, but has issues in one section.

FULL LESSON (for context):
${fullContent}

SECTION REQUIRING REVISION:
${sectionToFix}

ISSUE:
${issue.description}
Fix required: ${issue.suggested_fix}

CONSTRAINTS:
- Preserve all other sections unchanged
- Maintain transitions:
  * Lead-in from previous section: "${surroundingContext.before}"
  * Lead-out to next section: "${surroundingContext.after}"
- Use consistent terminology
- Match detail level of surrounding content

Rewrite ONLY the flagged section.
`.trim();
}

// Template 3: Iterative History Retention (Self-Refine method)
function buildIterativeFixPrompt(
  history: RefinementHistory
): string {
  return `
Revise content while maintaining all previous improvements.

ITERATIVE HISTORY:
${history.entries.map((entry, i) =&gt; `
--- Iteration ${i} ---
Content: ${entry.content.substring(0, 500)}...
Feedback: ${JSON.stringify(entry.feedback)}
Score: ${entry.score}
`).join('\n')}

CURRENT TASK:
Address remaining issues without regressing on previous fixes.

FIXED ISSUES (do not reintroduce):
${history.fixedIssues.map(i =&gt; `- ${i}`).join('\n')}

NEW ISSUES TO ADDRESS:
${history.currentIssues.map(i =&gt; `- ${i}`).join('\n')}

PRESERVE:
- All terminology established in previous revisions
- Successful examples from earlier iterations
- Improved structure from Iteration ${history.entries.length - 1}

Provide complete revised lesson maintaining all previous improvements.
`.trim();
}

Model-Specific Iteration Limits

Разные модели имеют разную "выносливость" к итеративному refinement:

// src/refinement/iteration-limits.ts

interface ModelIterationProfile {
  maxIterations: number;
  diminishingReturnsThreshold: number; // Min improvement per iteration
  exhaustionIndicators: string[];
}

const ITERATION_PROFILES: Record = {
  'gpt-4': {
    maxIterations: 3,
    diminishingReturnsThreshold: 0.03, // 3% min improvement
    exhaustionIndicators: [
      'repeating previous fixes',
      'introducing new errors while fixing old',
      'degrading previously good sections',
    ],
  },
  'gpt-3.5-turbo': {
    maxIterations: 2,
    diminishingReturnsThreshold: 0.05,
    exhaustionIndicators: [
      'circular edits',
      'loss of coherence',
    ],
  },
  'qwen2.5-coder': {
    maxIterations: 5, // Более устойчивая модель
    diminishingReturnsThreshold: 0.02,
    exhaustionIndicators: [
      'style drift',
      'verbosity increase',
    ],
  },
  'default': {
    maxIterations: 2,
    diminishingReturnsThreshold: 0.05,
    exhaustionIndicators: [],
  },
};

// Decision tree для refinement vs regeneration
async function decideRefinementStrategy(
  score: number,
  issues: Issue[],
  iterationCount: number,
  model: string
): Promise&lt;'accept' | 'targeted_fix' | 'iterative_refine' | 'regenerate' | 'escalate'&gt; {
  const profile = ITERATION_PROFILES[model] || ITERATION_PROFILES.default;

  // Score &gt; 0.90: Accept
  if (score &gt;= 0.90) {
    return 'accept';
  }

  // Score 0.75-0.90 with localized issues
  if (score &gt;= 0.75) {
    const localizedIssues = issues.filter(i =&gt; i.location !== 'entire lesson');
    if (localizedIssues.length / issues.length &gt; 0.7) {
      return 'targeted_fix';
    }
    return 'iterative_refine';
  }

  // Score 0.60-0.75: Iterative refinement if iterations remain
  if (score &gt;= 0.60) {
    if (iterationCount &lt; profile.maxIterations) {
      return 'iterative_refine';
    }
    return 'regenerate';
  }

  // Score &lt; 0.60: Immediate regenerate
  if (score &gt;= 0.40) {
    return 'regenerate';
  }

  // Score &lt; 0.40: Escalate to human/premium model
  return 'escalate';
}

Coherence Preservation Techniques

При targeted fixes критично сохранить coherence с остальным контентом:

// src/refinement/coherence-preservation.ts

// Technique 1: Context Windowing
function extractContextWindow(
  fullContent: string,
  targetSection: string,
  windowSize: number = 2 // paragraphs before/after
): { before: string; after: string } {
  const paragraphs = fullContent.split('\n\n');
  const targetIndex = paragraphs.findIndex(p =&gt; p.includes(targetSection));

  const beforeStart = Math.max(0, targetIndex - windowSize);
  const afterEnd = Math.min(paragraphs.length, targetIndex + windowSize + 1);

  return {
    before: paragraphs.slice(beforeStart, targetIndex).join('\n\n'),
    after: paragraphs.slice(targetIndex + 1, afterEnd).join('\n\n'),
  };
}

// Technique 2: Terminology Locking
function extractTerminologyGlossary(
  content: string,
  spec: LessonSpecification
): Map {
  const glossary = new Map();

  // Extract defined terms
  const definitionPatterns = [
    /([А-ЯЁA-Z][а-яёa-z]+)\s*[-—]\s*это\s+([^.]+)/g,
    /([А-ЯЁA-Z][а-яёa-z]+)\s+называется\s+([^.]+)/g,
    /под\s+([А-ЯЁA-Z][а-яёa-z]+)\s+понимается\s+([^.]+)/g,
  ];

  for (const pattern of definitionPatterns) {
    let match;
    while ((match = pattern.exec(content)) !== null) {
      glossary.set(match[1], match[2].trim());
    }
  }

  // Add terms from specification
  if (spec.keyTerms) {
    for (const term of spec.keyTerms) {
      if (!glossary.has(term.name)) {
        glossary.set(term.name, term.definition);
      }
    }
  }

  return glossary;
}

// Technique 3: Explicit Preservation Lists
function generatePreservationList(
  content: string,
  judgeStrengths: string[]
): string[] {
  const preserveList: string[] = [];

  // Preserve sections mentioned in strengths
  for (const strength of judgeStrengths) {
    const sectionMatch = strength.match(/(section|paragraph|example)\s+\d+/i);
    if (sectionMatch) {
      preserveList.push(`${sectionMatch[0]} (praised by judge)`);
    }
  }

  // Always preserve: introduction hook, conclusion summary
  preserveList.push('Introduction hook (lines 1-5)');
  preserveList.push('Conclusion summary (last 3 paragraphs)');

  return preserveList;
}

Circuit Breaker: защита от runaway costs

Проблема: Infinite Refinement Loops

Без ограничений система может застрять в цикле:

Generate → Score 0.65 → Refine → Score 0.68 → Refine → Score 0.66 → ...

Каждая итерация стоит денег, но improvement oscillates без прогресса.

Circuit Breaker Implementation

// src/evaluation/circuit-breaker.ts

interface CircuitBreakerConfig {
  maxIterations: number;
  maxTotalCost: number;
  minImprovementPerIteration: number;
  minFinalScore: number;
  escalationThreshold: number;
}

interface CircuitBreakerState {
  iterationCount: number;
  totalCost: number;
  scoreHistory: number[];
  lastDecision: string;
}

const DEFAULT_CONFIG: CircuitBreakerConfig = {
  maxIterations: 3,
  maxTotalCost: 0.05, // $0.05 per lesson max
  minImprovementPerIteration: 0.03, // 3% minimum
  minFinalScore: 0.75,
  escalationThreshold: 0.50,
};

function shouldBreakCircuit(
  state: CircuitBreakerState,
  currentScore: number,
  config: CircuitBreakerConfig = DEFAULT_CONFIG
): { break: boolean; reason: string; action: string } {
  // Rule 1: Max iterations exceeded
  if (state.iterationCount &gt;= config.maxIterations) {
    return {
      break: true,
      reason: 'max_iterations_exceeded',
      action: currentScore &gt;= config.minFinalScore
        ? 'accept_with_warning'
        : 'escalate_to_human',
    };
  }

  // Rule 2: Cost budget exceeded
  if (state.totalCost &gt;= config.maxTotalCost) {
    return {
      break: true,
      reason: 'cost_budget_exceeded',
      action: 'accept_current_best',
    };
  }

  // Rule 3: Diminishing returns detection
  if (state.scoreHistory.length &gt;= 2) {
    const lastScore = state.scoreHistory[state.scoreHistory.length - 1];
    const improvement = currentScore - lastScore;

    if (improvement &lt; config.minImprovementPerIteration) {
      return {
        break: true,
        reason: 'diminishing_returns',
        action: currentScore &gt;= config.minFinalScore
          ? 'accept'
          : 'escalate_to_human',
      };
    }
  }

  // Rule 4: Score oscillation detection
  if (state.scoreHistory.length &gt;= 3) {
    const recent = state.scoreHistory.slice(-3);
    const isOscillating =
      (recent[0] &lt; recent[1] &amp;&amp; recent[1] &gt; recent[2]) ||
      (recent[0] &gt; recent[1] &amp;&amp; recent[1] &lt; recent[2]);

    if (isOscillating) {
      return {
        break: true,
        reason: 'score_oscillation',
        action: 'accept_best_from_history',
      };
    }
  }

  // Rule 5: Critical failure threshold
  if (currentScore &lt; config.escalationThreshold) {
    return {
      break: true,
      reason: 'critical_failure',
      action: 'escalate_to_premium_model',
    };
  }

  // No break - continue refinement
  return { break: false, reason: '', action: 'continue' };
}

// Main evaluation loop with circuit breaker
async function evaluateWithCircuitBreaker(
  lesson: LessonContent,
  spec: LessonSpecification,
  ragContext: string | null
): Promise {
  const state: CircuitBreakerState = {
    iterationCount: 0,
    totalCost: 0,
    scoreHistory: [],
    lastDecision: '',
  };

  let currentContent = lesson.content;
  let bestResult: EvaluationResult | null = null;
  let bestScore = 0;

  while (true) {
    // Evaluate current content
    const result = await evaluateWithConditionalRag(
      { ...lesson, content: currentContent },
      spec,
      ragContext
    );

    state.iterationCount++;
    state.totalCost += result.cost;
    state.scoreHistory.push(result.finalScore);

    // Track best result
    if (result.finalScore &gt; bestScore) {
      bestScore = result.finalScore;
      bestResult = result;
    }

    // Check circuit breaker
    const breakerDecision = shouldBreakCircuit(state, result.finalScore);
    state.lastDecision = breakerDecision.reason;

    if (breakerDecision.break) {
      return {
        ...bestResult!,
        circuitBreakerTriggered: true,
        breakerReason: breakerDecision.reason,
        finalAction: breakerDecision.action,
        totalIterations: state.iterationCount,
        totalCost: state.totalCost,
      };
    }

    // Score acceptable - accept
    if (result.finalScore &gt;= 0.85) {
      return {
        ...result,
        circuitBreakerTriggered: false,
        breakerReason: '',
        finalAction: 'accept',
        totalIterations: state.iterationCount,
        totalCost: state.totalCost,
      };
    }

    // Refinement needed
    const strategy = await decideRefinementStrategy(
      result.finalScore,
      result.issues,
      state.iterationCount,
      lesson.generatorModel
    );

    if (strategy === 'escalate') {
      return {
        ...result,
        circuitBreakerTriggered: true,
        breakerReason: 'manual_escalation',
        finalAction: 'escalate_to_human',
        totalIterations: state.iterationCount,
        totalCost: state.totalCost,
      };
    }

    // Apply refinement
    currentContent = await applyRefinement(
      currentContent,
      result,
      spec,
      strategy
    );
  }
}

Model Fallback Hierarchy

// src/evaluation/model-fallback.ts

interface FallbackChain {
  generator: string[];
  judge: string[];
}

const FALLBACK_CHAINS: FallbackChain = {
  generator: [
    'qwen3-235b',      // Primary (Russian)
    'deepseek-terminus', // Primary (English)
    'kimi-k2',         // Fallback
    'gpt-4o-mini',     // Emergency (different architecture)
    'HUMAN',           // Last resort
  ],
  judge: [
    'gemini-flash',    // Primary judge
    'gpt-4o-mini',     // First fallback
    'claude-haiku',    // Second fallback
    'HUMAN',           // If all fail
  ],
};

async function executeWithFallback(
  chain: string[],
  operation: (model: string) =&gt; Promise,
  maxRetries: number = 2
): Promise&lt;{ result: T; modelUsed: string; fallbacksUsed: number }&gt; {
  let fallbacksUsed = 0;

  for (const model of chain) {
    if (model === 'HUMAN') {
      throw new Error('Human intervention required');
    }

    for (let retry = 0; retry &lt; maxRetries; retry++) {
      try {
        const result = await operation(model);
        return { result, modelUsed: model, fallbacksUsed };
      } catch (error) {
        console.warn(`Model ${model} failed (attempt ${retry + 1}):`, error);
      }
    }

    fallbacksUsed++;
    console.warn(`Falling back from ${model} to ${chain[fallbacksUsed]}`);
  }

  throw new Error('All models in fallback chain failed');
}

Cost Engineering: достижение $0.014 за курс

Breakdown целевого бюджета

Constraint: 0.50 за курс (10-30 уроков)
Target: ~70% на генерацию, ~30% на валидацию + refinement

Компонент	Budget/урок	При 20 уроках
Generation	$0.015	$0.30
Judging (CLEV)	$0.00234	$0.047
Refinement (30% уроков)	$0.005	$0.10
Total validation	$0.00734	$0.147
Total per course		$0.447

Optimization Strategies

Strategy 1: Prompt Caching

// Cached portion: ~2,000 tokens (rubric, instructions, examples)
const CACHED_PROMPT = `
[SYSTEM INSTRUCTIONS]
You are an expert Educational Content Evaluator...

[OSCQR RUBRIC]
${JSON.stringify(OSCQR_TRANSLATIONS)}

[FEW-SHOT EXAMPLES]
${FEW_SHOT_EXAMPLES}
`;

// Dynamic portion: ~1,500 tokens (lesson + spec)
const DYNAMIC_PROMPT = `
[LESSON CONTENT]
${lesson.content}

[SPECIFICATION]
${JSON.stringify(spec)}
`;

// Cost with caching (Anthropic: 90% cheaper for cached)
// First request: $0.00195
// Subsequent (within 5-10 min): $0.00078
// Batch processing 20 lessons: ~$0.016 (vs $0.039 without caching)

Strategy 2: Heuristic Pre-Filters (FREE)

// src/evaluation/heuristic-prefilter.ts

interface PreFilterResult {
  passed: boolean;
  issues: string[];
  skipJudge: boolean;
}

function runHeuristicPreFilters(
  lesson: LessonContent,
  spec: LessonSpecification
): PreFilterResult {
  const issues: string[] = [];

  // Filter 1: Length check
  const wordCount = lesson.content.split(/\s+/).length;
  if (wordCount &lt; spec.minWords || wordCount &gt; spec.maxWords) {
    issues.push(`Word count ${wordCount} outside range [${spec.minWords}, ${spec.maxWords}]`);
  }

  // Filter 2: Flesch-Kincaid (без LLM, алгоритмический)
  const fk = calculateFleschKincaid(lesson.content);
  const targetGrade = spec.targetGradeLevel;
  if (Math.abs(fk - targetGrade) &gt; 2) {
    issues.push(`Flesch-Kincaid ${fk} differs from target ${targetGrade} by &gt;2`);
  }

  // Filter 3: Required sections presence
  for (const section of spec.requiredSections) {
    if (!lesson.content.toLowerCase().includes(section.toLowerCase())) {
      issues.push(`Missing required section: ${section}`);
    }
  }

  // Filter 4: Keyword coverage
  const keywords = spec.requiredKeywords || [];
  const missingKeywords = keywords.filter(
    kw =&gt; !lesson.content.toLowerCase().includes(kw.toLowerCase())
  );
  if (missingKeywords.length &gt; keywords.length * 0.3) {
    issues.push(`Missing &gt;30% required keywords: ${missingKeywords.join(', ')}`);
  }

  // Filter 5: Structure markers
  const hasIntro = /^(введение|introduction|в этом уроке)/im.test(lesson.content);
  const hasConclusion = /(заключение|conclusion|подводя итог|в завершение)/im.test(lesson.content);
  if (!hasIntro || !hasConclusion) {
    issues.push('Missing intro or conclusion markers');
  }

  return {
    passed: issues.length === 0,
    issues,
    skipJudge: issues.length &gt; 3, // Immediate regenerate if too many issues
  };
}

// This filters 30-50% of content at ZERO cost

Strategy 3: Batch API Processing

// For non-real-time validation (pre-production QA)
// OpenAI Batch API: 50% discount, 24-hour processing

async function batchEvaluateCourse(
  lessons: LessonContent[],
  spec: CourseSpecification
): Promise {
  const requests = lessons.map((lesson, i) =&gt; ({
    custom_id: `lesson_${i}`,
    method: 'POST',
    url: '/v1/chat/completions',
    body: {
      model: 'gpt-4o-mini',
      messages: [
        { role: 'system', content: CACHED_PROMPT },
        { role: 'user', content: buildDynamicPrompt(lesson, spec.lessons[i]) },
      ],
      temperature: 0.1,
    },
  }));

  // Submit batch (50% discount)
  const batch = await openai.batches.create({
    input_file_id: await uploadRequests(requests),
    endpoint: '/v1/chat/completions',
    completion_window: '24h',
  });

  // Poll for completion
  while (batch.status !== 'completed') {
    await sleep(60000); // Check every minute
    batch = await openai.batches.retrieve(batch.id);
  }

  return parseBatchResults(batch.output_file_id);
}

// Cost: $0.00098/lesson (vs $0.00195 real-time)
// Total for 20-lesson course: $0.020

Final Cost Calculation

Hybrid Cascade Architecture:

Stage 1: Heuristic Pre-filters → FREE
         Filters 30-50% instantly

Stage 2: Single Judge (Gemini Flash) → $0.00065/lesson
         For 50-70% of content passing Stage 1
         Average: $0.00033/lesson

Stage 3: CLEV 3x Voting → $0.00195/lesson
         For 15-20% low-confidence cases
         Average: $0.00039/lesson

Refinement: 1 iteration for 30% of lessons → $0.00150/lesson
            Average: $0.00045/lesson

TOTAL: $0.00033 + $0.00039 + $0.00045 = $0.00117/lesson
       20 lessons: $0.0234

       vs Manual review: $80-240/course
       Savings: 3,400-10,300x

Заключение: Production Checklist

Минимальная Viable Implementation

Cross-Model Pairing: Генератор ≠ Judge family
CLEV Voting: 2 judges default, 3rd on disagreement
OSCQR Rubric: Weighted criteria with VETO thresholds
Entropy Pre-screening: Flag high-uncertainty factual claims
Circuit Breaker: Max 3 iterations, diminishing returns detection
Prompt Caching: 60-90% cost reduction on static portions

Monitoring Dashboard

interface JudgeMetrics {
  // Quality
  judgeHumanAgreement: number;       // Target: &gt;80%
  falsePositiveRate: number;         // Target: &lt;10%
  falseNegativeRate: number;         // Target: &lt;5%

  // Cost
  averageCostPerLesson: number;      // Target: &lt;$0.002
  clevActivationRate: number;        // Expect: 15-30%
  refinementRate: number;            // Target: &lt;30%

  // Operations
  circuitBreakerTriggerRate: number; // Target: &lt;5%
  humanEscalationRate: number;       // Target: &lt;2%
  averageIterationsPerLesson: number; // Target: &lt;1.5
}

Как проходит регулярная проверка качества

Раз в несколько месяцев выбираем 30–50 уроков и даём экспертам проверить их вручную.
Сравниваем оценки экспертов с тем, что выдал алгоритм.
Смотрим, где они расходятся и почему.
Исправляем критерии оценки и примеры, чтобы модель меньше ошибалась.
При необходимости корректируем пороги, при которых алгоритм «уверен» в своём решении.
Все изменения фиксируем, чтобы отслеживать прогресс.

Контакты и обратная связь

Канал (редкие посты): https://t.me/maslennikovigor

Прямой контакт: https://t.me/maslennikovig

GitHub

Issues: Для багов и технических вопросов

Discussions: Для идей и архитектурных дискуссий

Обратная связь

Буду рад услышать:

Критику — Где слабые места в архитектуре? Какие edge cases я не учел?
Альтернативы — Как вы решаете проблему валидации LLM-контента?
Бенчмарки — Если воспроизвели методологию — поделитесь результатами

Игорь Масленников
AI Dev Team, DNA IT
В IT с 2013 года

Источники

Self-Preference Bias: Arize AI — "Testing Self-Evaluation Bias" https://arize.com/blog/should-i-use-the-same-llm-for-my-eval-as-my-agent-testing-self-evaluation-bias/
Language Model Self-Preference: NYU Data Science — "Language Models Often Favor Their Own Text" https://nyudatascience.medium.com/language-models-often-favor-their-own-text-revealing-a-new-bias-in-ai-e6f7a8fa5959
OSCQR Rubric: SUNY Online Course Quality Review https://oscqr.suny.edu/
Self-Refine: OpenReview — "Iterative Refinement with Self-Feedback" https://openreview.net/forum?id=S37hOerQLB
Entropy Hallucination Detection: Arch Gateway — "Detecting Hallucinations with Entropy" https://www.archgw.com/blogs/detecting-hallucinations-in-llm-function-calling-with-entropy-and-varentropy
Log-Probability Uncertainty: ResearchGate — "Logprobs Know Uncertainty" https://www.researchgate.net/publication/394078106_Logprobs_Know_Uncertainty_Fighting_LLM_Hallucinations
DeepSeek Pricing: DeepSeek API Docs https://api-docs.deepseek.com/quick_start/pricing-details-usd
Temperature Effects: arXiv — "The Effect of Sampling Temperature on Problem Solving" https://arxiv.org/html/2402.05201v1
LLM Judge Evaluation: Galileo AI — "LLM-as-a-Judge vs Human Evaluation" https://galileo.ai/blog/llm-as-a-judge-vs-human-evaluation
Semantic Entropy: NIH PMC — "Detecting hallucinations using semantic entropy" https://pmc.ncbi.nlm.nih.gov/articles/PMC11186750/

LLM Judge для валидации образовательного контента: архитектура кросс-модельной оценки с бюджетом $0.014 за курс