Transfer Learning for Prosthetics Using Imitation Learning

This project tries to apply Reinforcement Learning (RL) to enable prosthetics to calibrate with differences between humans and differences between walking environments using OpenSim simulator to mimic the prosthetic motions and apperance.The project is a part from NeurIPS 2018: AI for Prosthetics Challenge

AI for prosthetics

Codes

Publications and Awards

  • Best Graduation project at University of khartoum. 2018. Khartoum, Sudan

  • Poster paper at NeurIPS 2018 Black in AI workshop, Montreal, Canada

  • Poster at Deep Learning Indaba 2018, South Africa

    Objectives

  • Benchmarking RL algorithms: Deterministic Policy Gradient DDPG, Trust Region Policy Optimization TRPO and Proximal Policy Optimization PPO algorithms.

  • Reduce training time using Imitation Learning algorithm Dataset Aggregation algorithm DAgger.

  • Modificat DAgger algorithm to balance between exploration and exploiting.

OpenSim Enviroment

OpenSim models one human leg and prosthetic in another leg.

Observations

the observations can be divided into five components:

  • Body parts: the agent observes its position, velocity, acceleration, rotation, rotational velocity, and rotational acceleration.

  • Joints: the agent observes its position, velocity and acceleration.

  • Muscles: the agent observes its activation, fiber force, fiber length and fiber velocity.

  • Forces: describes the forces acting on body parts.

  • Center of mass: the agent observes the position, velocity, and acceleration.

Actions

  • Muscles activation, lenght and velocity

  • Joints angels.

  • Tendons.

Reward

Where the is the horizontal velocity vector of the pelvi which is function of all state variables.

The termination condition for the episode is filling 300 steps or the height of the pelvis falling below 0.6 meters

Algorithms and Hyperparameters

  • DDPG is a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces.DDPG is based on the deterministic policy gradient (DPG) algorithm. it combines the actor-critic approach with insights from the recent success of Deep Q Network (DQN).

  • PPO is a policy optimization method that use multiple epochs of stochastic gradient ascent to perform each policy update.

  • TRPO is a model free, on-policy optimization method that effective for optimizing large nonlinear policies such as neural networks.

Results

  TRPO PPO DDPG
Mean Reward 43 -58 -42
Maximum Reward 194 70 113

Results

Demo

  • Random Actions

Random

TRPO

PPO

DDPG

Discussion

  • OpenSim ProstheticsEnv is a very complex environment, it contains more than 158 continuous state variables and 19 continuous action variables.

  • RL algorithms take a long time to build a complex policy which has the ability to compute all state variables and select action variables which will maximize the reward.

  • DDPG algorithm achieves good reward because it designed for high dimensions continuous space environments and it uses the replay buffer.

  • PPO the least training time comparing to DDPG and TRPO because PPO uses gradient algorithm approximation instance of the conjugate gradient algorithm.

  • TRPO algorithm achieved the maximum Reward because it takes time to reach the “trusted” region so it slower than DDPG and PPO .

Limitations

  • The prosthetic model can not walk for large distances.

  • Each experiment runs for one time, So we are planing to Repeat each experiment number of times with different random seeds and take the average and variance.

  • We used same hyperparameters for all algorithm to make benchmarking between algorithms, we need to select the best hyperparameters for each algorithm and environment.

  • We benchmarcked three RL algorithms only and from one library(ChainerRL). So we are planing to use different implementations.

  • We transfer learning between similar agents.

Installation

ChainerRL libary

  • ChainerRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using Chainer, a flexible deep learning framework.

  • ChainerRL contains DQN, DDPG, TRPO, PPO, etc Reinforcment Learning algorithms.

Environment

To model physics and biomechanics we use OpenSim - a biomechanical physics environment for musculoskeletal simulations.

Installing

Install OpenSim Envirnment

conda create -n opensim-rl -c kidzik opensim python=3.6.1
source activate opensim-rl

Install ChainerRL libary

pip install chainerrl

References

  • T. Garikayi, D. van den Heever and S. Matope, (2016), Robotic prosthetic challenges for clinical applications, IEEE International Conference on Control and Robotics Engineering (ICCRE), Singapore, 2016, pp. 1-5. doi: 10.1109/ICCRE.2016.7476146

  • Joshi, Girish \& Chowdhary, Girish. (2018). Cross-Domain Transfer in Reinforcement Learning using Target Apprentice.

  • Lillicrap, Timothy \& J. Hunt, Jonathan \& Pritzel, Alexander \& Heess, Nicolas \& Erez, Tom \& Tassa, Yuval \& Silver, David \& Wierstra, Daan. (2015). Continuous control with deep reinforcement learning. CoRR.

  • Attia, Alexandre \& Dayan, Sharone. (2018). Global overview of Imitation Learning.

  • Cheng, Qiao \& Wang, Xiangke \& Shen, Lincheng. (2017). An Autonomous Inter-task Mapping Learning Method via Artificial Neural Network for Transfer Learning. 10.1109/ROBIO.2017.8324510.
  • J.J. Zhu, DAgger algorithm implementation, (2017), GitHub repository, https://github.com/jj-zhu/jadagger.