Interpretable Reinforcement Learning Using Latent Reward Functions

Reinforcement learning can help us to model complex behaviour through developing agents that can acquire complex skills through the interaction with an environment, yet it still faces problems that prevented to be adapted in the industry and real world applications. Interpretability plays a main role in speeding up that adaptation, thus in this work, we address the problem of interpretability by proposing a definition based on the reward function R, then proposing an algorithm that recovers latent reward functions from trajectories using hierarchical RL (HRL) and Inverse RL based segmenter, and finally maps the segmenter output to a human-readable text.