While using LLMs to judge LLM outputs might seem like the fox guarding the henhouse, turns out it works pretty well (and scales better than humans).
While using LLMs to judge LLM outputs might seem like the fox guarding the henhouse, turns out it works pretty well (and scales better than humans).