Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Du, Yihan; Winnicki, Anna; Dalal, Gal; Mannor, Shie; Srikant, R.

Computer Science > Machine Learning

arXiv:2402.10342 (cs)

[Submitted on 15 Feb 2024]

Title:Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Authors:Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, R. Srikant

View PDF

Abstract:Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes knowledge of the reward function. In PO-RLHF, knowledge of the reward function is not assumed and the algorithm relies on trajectory-based comparison feedback to infer the reward function. We provide performance bounds for PO-RLHF with low query complexity, which provides insight into why a small amount of human feedback may be sufficient to get good performance with RLHF. A key novelty is our trajectory-level elliptical potential analysis technique used to infer reward function parameters when comparison queries rather than reward observations are used. We provide and analyze algorithms in two settings: linear and neural function approximation, PG-RLHF and NN-PG-RLHF, respectively.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2402.10342 [cs.LG]
	(or arXiv:2402.10342v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.10342

Submission history

From: Yihan Du [view email]
[v1] Thu, 15 Feb 2024 22:11:18 UTC (80 KB)

Computer Science > Machine Learning

Title:Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators