На регулярных проверках на потоке видим цифры выше 0,8 (значение естественно плавает, зависит от распределения ошибок). При этом периодически ловим кейсы, которые текущие промпты плохо ловят и тогда обновляем промпты, либо препроцессинги.
JUDGE_PROMPT = """ You will be given a user_question and system_answer couple. Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question. Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.
Provide your feedback as follows:
Feedback::: Total rating: (your rating, as a float between 0 and 10)
На регулярных проверках на потоке видим цифры выше 0,8 (значение естественно плавает, зависит от распределения ошибок). При этом периодически ловим кейсы, которые текущие промпты плохо ловят и тогда обновляем промпты, либо препроцессинги.
В промпте судью просим проставить оценку по шкале. Например, от 0 до 1 или от 0 до 100. Например, промпт с HF (https://huggingface.co/learn/cookbook/llm_judge):
JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.
Provide your feedback as follows:
Feedback:::
Total rating: (your rating, as a float between 0 and 10)
Now here are the question and answer.
Question: {question}
Answer: {answer}
Feedback:::
Total rating: """