Counter Factual Multi-Agent Policy Gradients (COMA)

Raymond Danks
Jun 28, 2020
1 min read

Recently, I implemented the COMA paper (https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193/16614 ) which is a Multi-Agent Reinforcement Learning (MARL) system which encourages cooperability and communication between agents.

This implementation shows and reinforced my ability to implement usable systems from complex research papers, as well as understand their flaws and advantages. Specifically, while implementing COMA, my knowledge of TD (Temporal Difference) learning, Policy Gradients and A2C (Actor Advantage Critic), which are all core Reinforcement Learning concepts.

My Implementation can be seen below (hosted using nbviewer for Jupyter Notebooks) and can also be seen on my GitHub:

https://github.com/RayDanks/COMA-implementation/blob/master/Counter_Factual_Multi_Agent_Policy_Gradient_Implementation.ipynb

Scroll bars are given at the bottom of each cell below, although the above GitHub repository allows for better viewing and formatting of the implementation.

Moreover, during this implementation, a potential error was spotted within the ma-gym wiki's description of the reward for the combat game.

The reward system described on the ma gym wiki do not seem to coincide with the ma gym source code. Specifically, one point is subtracted from the reward if a friendly agent is hit, and vice versa for one point added. This means that the zero case is either when no agent hits another (most likely), or when the amount of hits is exactly even (unlikely). During exploration, whenever the friendly team randomly meets the enemy team, the hardcoded enemy team likely beats them and the net reward is negative for the friendly team. Therefore, with this reward setup, the likely result of learning is for the friendly team to actively avoid the enemy team, rather than fighting them.

Below, I have written a more in-depth and rigorous analysis of the COMA paper, along with some analysis on my actual implementation's individual performance (and why this may be):

NB the below explanation is mathematical and is shown on this blog via screenshots from Google Collab's Markdown cells, since Wix Blogs does not support magthematical (Latex) notation.

Sources: IEEE Referencing used. All references were most recently accessed on 27th April 2020.

[1] A.Violante, “Simple reinforcement learning: Temporal difference learning.” https://medium.com/@violante.andre/simple-reinforcement-learning-temporal-difference-learning-e883ea0d65b0.

[2] nevkontakte, “mountain_car_v3_tf_keras.” https://gist.github.com/nevkontakte/2db02b57345ca521d541f8cdbf4081c5.

[3] A.L.Ecoffet, “An intuitive explanation of policy gradient.” https://towardsdatascience.com/an-intuitive-explanation-of-policy-gradient-part-1-reinforce-aa4392cbfd3c.

[4] MathWorks, “Actor-critic agents.” https://uk.mathworks.com/help/reinforcement-learning/ug/ac-agents.html#mw_2067355b-96d6-452d-bcd6-18d6e4b8614f.

[5] LuEE-C, “A2c-keras.” https://github.com/LuEE-C/A2C-Keras/blob/master/Main.py.

[6] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson,“Counterfactual multi-agent policy gradients,”CoRR, vol. abs/1705.08926,2017.

[7] oxwhirl, “Python multi-agent reinforcement learning framework.” https://github.com/oxwhirl/pymarl.

[8] Python, “Time complexity.” https://wiki.python.org/moin/TimeComplexity.

[9] Microsoft, “Counterfactual multi-agent policy gradients.” https://www.youtube.com/watch?v=3OVvjE5B9LU&t=1406s.

Counter Factual Multi-Agent Policy Gradients (COMA)

Recent Posts

Comments