Fine-Tuning LLMs with Reinforcement Learning from Human Feedback (RLHF)
towardsai.net
LatestMachine LearningFine-Tuning LLMs with Reinforcement Learning from Human Feedback (RLHF) 0 like January 21, 2025Share this postAuthor(s): Ganesh Bajaj Originally published on Towards AI. This member-only story is on us. Upgrade to access all of Medium.Reinforcement Learning from Human Feedback (RLHF) allows LLMs to learn directly from the feedback received on its own response generation. . By including human preferences into the training process, RLHF enables the development of LLMs which are more aligned with user needs and values.This article is about the core concepts of RLHF, its implementation steps, challenges, and advanced techniques like Constitutional AI.Image Taken from Deeplearning.ai: Generative AI with LLM courseAgent: LLM acts as the agent whose job is to generate text. Its objective is to maximize alignment of its generation with human preferences like like helpfulness, accuracy, relevance, and non-toxic.Environment: The environment is the LLMs context window the space in which text can be entered via a prompt.State: The state is the current context within the context window which model considers to generate next token/action. It includes the prompt and the text generated up to the current point.Action: The LLMs action is generating a single token (word, sub-word, or character) from its vocabulary.Action Space: The action space comprises the entire vocabulary of the LLM. The LLM chooses the next token to generate from this vocabulary. The size of Read the full blog for free on Medium.Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AITowards AI - Medium Share this post
0 Comments ·0 Shares ·40 Views