Self-Play Preference Optimization (SPPO): An innovative machine learning approach to fine-tuning large language models (LLMs) from human/AI feedback

Large Language Models (LLMs) have demonstrated remarkable abilities in generating human-like text, answering questions, and coding. However, they face hurdles that require high reliability, security and ethical compliance. Reinforcement Learning from Human Feedback (RLHF) or Preference-based Reinforcement Learning (PbRL) is proving to be a promising solution. This framework has shown significant success in fine-tuning LLMs to adapt them to human preferences and increase their utility.

Existing RLHF approaches such as InstructGPT are based on explicit or implicit reward models, e.g. B. the Bradley-Terry model. Recent research examines direct preference probabilities to better represent human preferences. Some researchers formulate RLHF as finding Nash equilibria in constant sum games and propose mirror descent and self-play preference optimization (SPO) methods. Direct Nash Optimization (DNO) was also introduced due to win rate gaps, but its practical implementation is still based on iterative DPO frameworks.

Researchers from the University of California, Los Angeles and Carnegie Mellon University present a robust self-play framework, Self-Play Preference Optimization (SPPO), for targeting language models to address RLHF challenges. It provides provable guarantees for solving two-player constant sum games and scalability for large language models. In formulating RLHF as such a game, the goal is to identify the Nash equilibrium policy and ensure consistently preferred responses. They propose an adaptive algorithm based on multiplicative weights that uses a self-playing mechanism where the policy fine-tunes itself based on synthetic data annotated by the preference model.

The Self-Play framework aims to solve two-player constant sum games efficiently and at scale for large language models. It uses an iterative framework based on multiplicative weight updates and a self-playing mechanism. The algorithm asymptotically converges to the optimal strategy and identifies the Nash equilibrium. Theoretical analysis ensures convergence and provides provable guarantees. Compared to existing methods such as DPO and IPO, SPPO has improved convergence and efficiently addresses data scarcity issues.

The researchers evaluate models using GPT-4 for automatic evaluation and present the results on AlpacaEval 2.0 and MT-Bench. SPPO models continually improve across all iterations, with SPPO Iter3 having the highest win rate. Compared to DPO and IPO, SPPO achieves superior performance and effectively controls the output length. Reranking during testing with the PairRM reward model continuously improves model performance without causing over-optimization. SPPO outperforms many state-of-the-art chatbots on AlpacaEval 2.0 and remains competitive with GPT-4 on MT-Bench.

Finally, the paper presents Self-Play Preference Optimization (SPPO), a robust method for fine-tuning LLMs using human/AI feedback. By using self-play in a two-player game and a preference-based learning objective, SPPO significantly outperforms existing methods such as DPO and IPO in various benchmarks. By integrating a preference model and batch estimation, SPPO closely aligns LLMs with human preferences and addresses issues such as “length bias” in reward hacking. These results suggest that SPPO has the potential to improve the targeting of generative AI systems and argue for its wider adoption in LLMs and beyond.

Visit the Paper. All credit for this research goes to the researchers of this project. Also don’t forget to follow us Twitter. Join our… Telegram channel, Discord channelAnd LinkedIn Grupp.

If you like our work, you will love ours Newsletter..

Don’t forget to join our 41k+ ML SubReddit

Asjad is an intern as a consultant at Marktechpost. He is studying B.Tech in Mechanical Engineering from the Indian Institute of Technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is constantly exploring the applications of machine learning in healthcare.

Source link