As an example of the different strategies emerging by adopting ON- or OFF-policy TD algorithms, I use here a task described in the new edition of the Sutton and Barto 2017 (you can download the entire book for free here : http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-bookdraft2016sep.pdf, see chapter 6).
In this task the agent is required to navigate the environment to reach the goal “R”, from its starting state (see the magenta/blue circle in the picture below). Each step in the environment is punished with a negative reward (reward=-1), thus making the shortest path an optimal solution. However, the environment is characterised by obstacles that make it impossible for the agent to simply walk on a straight line from the starting point to its goal. These obstacles are represented by a “wind” which pushes the agent upwards (by either one or two blocks) at each transition taking place in the states belonging to any of the columns between 4th and 9th. Thus, the agent has to develop a strategy that allows it to consider the presence of these “windy” states and compensate for the upward movement, in order to reach the desired goal state. To allow for exploration, both algorithms follow an ε-greedy strategy, which dictates that for ε=0.1 the agent will choose the highest reward among the available ones 90% of the cases and a random action in the remaining 10%.
The two algorithms record in a significantly different way the consequences (in terms of rewards and punishments) of the actions performed in each state. On the one hand, OFF-policy TD learning (or Q-learning) considers the maximum value that is associated with any of the four actions (up-down-left-right), when it has to ascribe a new value to the action that has been just completed. On the other hand, ON-policy TD learning (or SARSA) considers all actions associated to the achieved state.
Due to the specifics of this task, both algorithms eventually converge in finding the same path, despite differences in the internal representation of the environment. This difference can be highlighted graphically with a heat map where I have used the mean value of all four actions as a general value per each state in the environment.
Here you can download the code (zip archive, run “main_wind” for a test):