Search By Topic The Green Supply Chain Distribution Digest
Supply Chain Digest Logo

Dr. Michael Watson
Northwestern University

Supply Chain by Design

Dr. Michael Watson, one of the industry’s foremost experts on supply chain network design and advanced analytics, is a columnist and subject matter expert (SME) for Supply Chain Digest.

Dr. Watson, of Northwestern University, was the lead author of the just released book Supply Chain Network Design, co-authored with Sara Lewis, Peter Cacioppi, and Jay Jayaraman, all of IBM. (See Supply Chain Network Design – the Book.)

In addition to teaching at Northwestern, Watson is a founding partner at Opex Analytics. 

November 1, 2017

Reinforcement Learning Explained Using the Beer Game

This Article was Written by Larry Snyder of Opex Analytics and Lehigh University

Reinforcement learning (RL) is a hot topic in the artificial intelligence/machine learning world these days. Google’s DeepMind group used “deep” RL to play classic Atari games like Space Invaders and Breakout, often outperforming expert human players. And it stunned the Go community when its AlphaGo program beat world-champion Go player Lee Sedol in 2016.

Watson Says...

We think of this as a proof-of-concept demonstration that machine learning algorithms can recommend good actions in realistic supply chain environments.

What do you say?

Click here to send us your comments

A research team at Lehigh University has built a deep RL algorithm to play the beer game. And Opex Analytics developers have been working on a user interface for the beer game that showcases the algorithm -- and can also be played in the usual way (for free) in classrooms or other settings. (We’ll be back on these pages in a few months once the game is released.)

If you’ve ever taken a supply chain class, you’ve probably played the beer game. The game involves four players representing four stages of a supply chain: retailer, wholesaler, distributor, and manufacturer. Each player must decide how much to order from its upstream partner, given the order quantity that it received in the current time period. The players’ goal is to minimize the total cost of the supply chain, but they cannot communicate with each other during the game. 

RL in a Nutshell 

So, what is RL? In short, it is a type of machine learning algorithm that decides what action to take, at every time period, based on the state that the system is in, in order to maximize some long-term reward. These algorithms begin by trying more or less random actions, from which they observe the resulting reward and then learn to improve their actions in the future. Critically, these algorithms are not programmed to execute any pre-determined strategy -- they learn that on their own. 

For example, when the DeepMind RL algorithm plays Space Invaders, the system state is just the pixels on the screen, which the algorithm parses to determine the locations of the enemy invaders, the player’s cannon, and so on. The actions are whether to move left, right, or neither, and whether to fire. And the reward is the score. DeepMind didn’t program the algorithm to hide beneath the shields or to target the high-value enemies -- it learned that on its own. 

RL and the Beer Game 

In our beer game RL algorithm, the system state consists of the player’s current on-hand and on-order inventory, its backorders, and its inbound order quantity. The action is the outbound order quantity, and the reward is the total supply chain cost (or really its negative).  

Our beer game RL algorithm borrows ideas from the DeepMind research but extends the approach to account for the significant ways that the beer game is different from Atari and Go. For example, the DeepMind approach is designed for single-agent games (like Space Invaders) or competitive, zero-sum games (like Go), whereas the beer game is a cooperative, non-zero-sum game (since the players are trying to maximize the team’s performance). Moreover, in Atari and Go, the player knows the state of the system at each time step, whereas in the beer game, the other players’ inventory levels -- and their associated rewards -- are unknown to each individual player until the game ends.

Any RL algorithm needs to be “trained” -- the process of choosing actions, observing the rewards, and then choosing better actions next time. Before our algorithm was trained, it was, unsurprisingly, a terrible beer game player. For example, when the RL algorithm played the role of the wholesaler, it ordered far too much, resulting in huge inventory levels at the wholesaler and big backorders upstream. (Inventory shows as brown boxes and positive numbers in the screenshot below, while backorders show as red boxes and negative numbers.)


After the algorithm played the game a few thousand times, it got better. It learned to keep the inventory levels lower at the wholesaler, -- in fact, it started to keep them too low:

But after playing for a while longer, the algorithm learned that the optimal strategy is a so-called base-stock policy, or order-up-to policy. It learned that it didn’t need much inventory at the wholesaler in order to keep the retailer well-stocked, and that since there is no cost for backorders upstream (only at the retailer), keeping inventory levels low or even negative there can be effective:

(Note that the screenshot shows inventory levels at the end of a period. An inventory level of 0 is ideal -- it means we met the demand perfectly, with no extra inventory and no shortages.)


We didn’t explicitly tell our code to follow a base-stock policy or to exploit the free upstream stockouts -- it learned that on its own.


Here’s a video that shows how the algorithm learns as it trains:



Where to Next?


It’s still early days for this research: For now, our algorithm can only handle simple demand structures, and we can’t yet operate more than one player at a time using an RL agent. But we think of this as a proof-of-concept demonstration that machine learning algorithms can recommend good actions in realistic supply chain environments. In the future, we’ll be applying these ideas to more complicated decisions in more complicated supply chains, like yours.


Visit the project website for updates on our research. You can also read the current version of our research paper on arXiv. And if you’d like to be notified when the Opex Analytics Beer Game is released, visit this page and leave us your e-mail address.

Larry Snyder Ph.D.

Senior Research Associate

Larry Snyder is a Senior Research Associate at Opex Analytics.  He is also an Associate Professor of Industrial and Systems Engineering at Lehigh University.  Larry Co-authored Fundamentals of Supply Chain Theory.

Any reaction to this Expert Insight column? Send below.

Your Comments/Feedback




Follow Us

Supply Chain Digest news is available via RSS
RSS facebook twitter youtube
bloglines my yahoo
news gator


Subscribe to our insightful weekly newsletter. Get immediate access to premium contents. Its's easy and free
Enter your email below to subscribe:
Join the thousands of supply chain, logistics, technology and marketing professionals who rely on Supply Chain Digest for the best in insight, news, tools, opinion, education and solution.
Home | Subscribe | Advertise | Contact Us | Sitemap | Privacy Policy
© Supply Chain Digest 2006-2023 - All rights reserved