Rating (0 votes):
Not yet rated
( Send PM
<a href="http://www.searching-eye.com/sanjeevs.html" target="_blank">SANJEEV SHARMA</a> : 2nd Nov 2010: Reinforcement Learning: Phase-II, Presentation-2: ARL-10/11 - (Lecture-2): Fixed-Point Estimation of State-Action Value Function & Least-Squares Policy Iteration.
CONTENTS: Fixed-Point Estimation of State-Action Value Function, Lease-Squares Policy Iteration (LSPI).
DESCRIPTION: LSPI is an off-policy algorithm. It's a modification over the LSTD. LSPI has an advantage over the LSTD algorithm, for it doesn't require the new samples to be collected for the computation of value function for each new policy. Being an off-policy algorithm, it can accept samples from any random policy. This provides the data-efficiency to the LSPI algorithm which was not possessed LSTD Algorithm. LSPI, however uses an algorithm, LSQ (LSTDQ) to compute the value function for a policy, which is actually the off-policy innovation in the LSPI algorithm. LSPI returns the Fixed-Point Solution which is discussed in detail in the lecture.