Prometheus 2: An open source language model that accurately reflects human and GPT-4 judgments when evaluating other language models

Natural language processing (NLP) aims to enable computers to understand and interact with human language. A key challenge in NLP is evaluating language models (LMs) that generate answers across different tasks. The variety of these tasks makes it difficult to effectively assess the quality of the answers. As LMs like GPT-4 become more complex, proprietary models often provide strong evaluation capabilities but suffer from transparency, control, and cost issues. This requires the development of reliable open source alternatives that can effectively assess speech output without compromising on these aspects.

The problem is multifaceted and concerns the evaluation of answers and the scalability of evaluation mechanisms. Existing evaluation tools, particularly open source models, have several limitations. Many models do not provide features for direct scoring and paired ranking, the two most common forms of scoring. This limits their adaptability to various real-world scenarios. They prioritize general characteristics such as helpfulness and harmlessness and assign ratings that differ significantly from human ratings. This inconsistency leads to unreliable ratings and requires improved rater models that accurately reflect human judgments.

Research teams have tried to fill these gaps using various methods. However, most approaches lack comprehensive flexibility and cannot accurately simulate human judgments. Current proprietary models such as GPT-4 remain expensive and opaque, which hinders widespread evaluation use. The research team from KAIST AI, LG AI Research, Carnegie Mellon University, MIT, the Allen Institute for AI and the University of Illinois Chicago presented Prometheus 2, a novel open-source evaluator for evaluating language models to solve the problem. This model is designed to provide transparent, scalable and controllable assessments while achieving the assessment quality of proprietary models.

Prometheus 2 was developed by merging two evaluator LMs: one trained exclusively for direct evaluation, the other for pairwise ranking. Combining these models created a unified evaluator that excels in both evaluation formats. The researchers used the newly developed Preference Collection dataset, which includes 1,000 evaluation criteria, to further refine the model’s capabilities. By effectively combining the two training formats, Prometheus 2 can evaluate LM responses using direct scoring and pairwise ranking methods. The merged model uses a linear merging approach to combine the strengths of both assessment formats to achieve high performance across assessment tasks.

The model demonstrated the highest correlation with human and proprietary raters in benchmarking tests on four direct rating benchmarks: Vicuna Bench, MT Bench, FLASK and Feedback Bench. Pearson correlations exceeded 0.5 on all benchmarks, reaching 0.878 and 0.898 on the feedback bench for models 7B and 8x7B, respectively. On four pairwise ranking benchmarks, including HHH Alignment, MT Bench Human Judgment, Auto-J Eval, and Preference Bench, Prometheus 2 outperformed existing open source models and achieved accuracy scores above 85%. The Preference Bench, an in-domain test set for Prometheus 2, demonstrated the robustness and versatility of the model.

Prometheus 2 was able to reduce the performance gap with proprietary evaluators such as GPT-4 in various benchmarks. The model halved the correlation difference between humans and GPT-4 on the FLASK benchmark and achieved 84% accuracy on HHH alignment scores. This highlights the significant potential for open source evaluators to replace expensive proprietary solutions while ensuring comprehensive and accurate assessments.

In summary, the lack of transparent, scalable, and adaptive language model evaluators that accurately reflect human judgment represents a significant challenge in NLP. To address this problem, researchers have developed Prometheus 2, a novel open-source evaluator. They used a linear fusion approach and combined two models trained separately on direct scoring and pairwise ranking. This unified model outperformed previous open source models in benchmarking tests, demonstrating high accuracy and correlation while significantly closing the performance gap with proprietary models. Prometheus 2 represents a significant advance in open source evaluation and provides a robust alternative to proprietary solutions.

Visit the Paper And Github. All credit for this research goes to the researchers of this project. Also don’t forget to follow us Twitter. Join our… Telegram channel, Discord channelAnd LinkedIn Grupp.

If you like our work, you will love ours Newsletter..

Don’t forget to join our 41k+ ML SubReddit

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. His most recent venture is the launch of an artificial intelligence media platform, Marktechpost, which features in-depth coverage of machine learning and deep learning news that is both technically sound and easy to understand for a wide audience. The platform has more than 2 million monthly views, which shows its popularity among the audience.

Source link