* QA eval dataset as argument, with hotpot and 2wikimultihop as options. Json schema validation for datasets. * Load dataset file by filename, outsource utilities * restructure metric selection * Add comprehensiveness, diversity and empowerment metrics * add promptfoo as an option * refactor RAG solution in eval;2C * LLM as a judge metrics implemented in a uniform way * Use requests.get instead of wget * clean up promptfoo config template * minor fixes * get promptfoo path instead of hardcoding * minor fixes * Add LLM as a judge prompts * Minor refactor and logger usage
9 lines
690 B
Python
9 lines
690 B
Python
# LLM-as-a-judge metrics as described here: https://arxiv.org/abs/2404.16130
|
|
|
|
llm_judge_prompts = {
|
|
"correctness": "Determine whether the actual output is factually correct based on the expected output.",
|
|
"comprehensiveness": "Determine how much detail the answer provides to cover all the aspects and details of the question.",
|
|
"diversity": "Determine how varied and rich the answer is in providing different perspectives and insights on the question.",
|
|
"empowerment": "Determine how well the answer helps the reader understand and make informed judgements about the topic.",
|
|
"directness": "Determine how specifically and clearly the answer addresses the question.",
|
|
}
|