Full Name
Priyank Agarwal
University/Company
Columbia University
Talk Title
From Episodes To Eternity: Optimistic Q-learning For Long-term Rewards
Abstract
In this talk, I’ll present a new optimistic Q-learning algorithm designed to minimize regret in average-reward reinforcement learning — without relying on strong assumptions about the environment. Specifically, we only assume that every policy reaches a single “frequent” state within a bounded time $H$, a much milder condition than the commonly used requirement of uniformly bounded hitting times for all states.

This setting encompasses traditional episodic RL and significantly broadens the scope of model-free learning for average-reward problems. Our algorithm achieves a regret bound of $\tilde{O}(H^5 S \sqrt{A T})$, showing provable efficiency even in this more general regime.

A key technical insight lies in our introduction of an averaged Bellman operator $\overline{L} v = \frac{1}{H} \sum_{h=1}^H L^h v$ (where $L$ is the usual Bellman operator), which contracts under span seminorm despite operating in the average-reward (undiscounted) setting. We leverage this operator through a Q-learning-style update rule inspired by the episodic case, providing a unified approach to regret minimization that bridges the gap between episodic and non-episodic RL. The talk is based on the paper https://arxiv.org/abs/2407.13743