Artificial intelligence makes stock management more efficient

Key insights

  • First use of deep reinforcement learning in logistics applications
  • Machine learning algorithms can be used to make logistics cheaper and more sustainable
  • Multimodal transport and dual sourcing become feasible  

What do AlphaGo, a robot that plays table tennis, a program that recognises emotions, and self-driving cars all have in common, besides the fact that all of them appeal to our imagination and are guaranteed to hit the headlines? They are all applications that use deep reinforcement learning (DRL). Intelligent stock management has recently joined that list. Along with two other colleagues, doctoral researcher Joren Gijsbrechts and Professor Robert Boute have demonstrated that DRL can be applied successfully to problems that seemed all but unsolvable up to now. For the very first time!

Practical limitations

“The request came from a major player in the FMCG sector”, Robert tells us. “They wanted to take as much freight as possible off the road and transport it by rail to reduce their CO2 footprint. Rail transport has a lower impact in terms of CO2, but it is slower and less flexible. If you succeed in creating a smart combination of the two transport channels, however, you bring together the ecological benefits of rail and the flexibility of road transport, enabling you to react fast to fluctuating demand. A modal shift of this kind appears to be simple, but it isn’t. The same applies to dual sourcing, i.e. sourcing from two suppliers: one is local with short delivery times, but more expensive, and the other is a cheaper, foreign supplier with longer delivery times. All kinds of mathematical models have been developed to solve supply issues like these, but as useful as they are, they are purely academic. As soon as you try to use real data, the models hit their limits. Cost functions, for example, are not always linear in practice. They are complex, too complex to express as a mathematical formula.”

Learning from feedback

“The Supply Chain Optimization Faculty Summit organised by Amazon was a eureka moment”, Robert recalls. “Thirty academics from all over the world gathered to discuss optimization of the logistics chain. We realised that Amazon is increasingly committed to machine learning, specifically reinforcement learning (RL), which means learning based on feedback from the environment.”

In DRL, a deep neural network (see box text) is used to train an RL algorithm. The best-known DRL algorithm is the one used for AlphaGo, the program that beat the human world champion at the board game Go. RL is one of the categories of machine learning, along with supervised and unsupervised learning, which are typically used in robotics. Instead of programming a given action or series of actions from A to Z, right down to the last details, you enable the robot to teach itself by rewarding good behaviour. Incidentally, that is the best option if a robot like this is supposed to interact with an environment it is not familiar with, and if so many different situations can arise that it is impossible to program the entire decision tree with the appropriate response to each situation or status. Then it is a question of enabling the robot, or more generally a system, to discover which action is best in which situation to reach the intended goal, by trial and error, controlled by feedback. In other words, to find out what action generates the biggest reward expressed as a numerical value. 

Neural networks

Artificial neural networks (ANNs) are mathematical models loosely inspired by the structure and functioning of the human brain. The aim of an ANN is the same as that of a biological brain, to solve a problem and learn from its mistakes. Just as our brain is made up of neurons that exchange electrochemical signals through synapses, ANNs consist of neurons or ‘nodes’ that are connected to each other.  

 
A neural network consists of several layers. Each layer processes the data (or part of it) and sends the result of that processing work to the next layer, ultimately producing a certain output.

There are various forms of network architecture. Typically, there is an input layer and an output layer of neurons, with one or more hidden layers in between. Deep neural networks have several hidden layers. 

What is the advantage of an ANN? Each node can only execute a few simple processes (addition, multiplication, division), but by combining a very large number of nodes in a network, an ANN can approximate highly complicated functions, thereby arriving at far more complex insights than most traditional algorithms.

Better and better decisions

“Let’s apply that to a specific issue: a company wants to optimise its supply lines by combining rail and road transport in the best possible way, which is to say at minimal cost, with as small an ecological footprint as possible and without compromising on customer satisfaction”, Robert says. “An extra factor to complicate matters is that transportation is done in containers, which leads to complicated cost functions. There is a vast number of possible combinations of stock levels, and different choices are recommended for each combination or situation. The system knows the state of the environment at all times, i.e. the stock on site and the stock in transit, and on that basis the algorithm will decide what quantity needs to be transported by road and what quantity by rail. In turn, this decision affects the environment – the stock – and the costs you want to minimise, the ecological footprint and the service level, and thus influences the decision to be made at a later point in time. The costs are calculated for every decision. When training the algorithm (and the neural network), the system learns to work better and better, i.e. to make a decision that costs less than the last one.”

Intuition and deferred rewards

“In essence, our logistical optimisation problem is no different to the optimisation problem solved by AlphaGo or self-driving cars”, Joren adds. “The latter need to travel from A to B as fast as possible without accidents. RL is particularly well suited to this as well, because the environment in which the cars operate is by definition unknown. An infinite number of different situations may arise. The cars need to develop a kind of intuition about safe and unsafe situations, situations in which they need to brake or make evasive manoeuvres, etc. That intuition is built up by training the algorithm, so that the cars will also react appropriately in situations that did not arise in training.”

He points out that RL may also involve deferred rewards. “A given action in a given situation not only influences the immediate reward, but also impacts the situation resulting from that action, and therefore also future rewards. For AlphaGo, a bad move at time t might mean losing the game five moves later. In our stock problem, certain decisions can generate future costs. So the present value of the costs is what needs to be minimised. The algorithm teaches the system to take that into account.” 

In partnership with Google

There are various DRL algorithms and network architectures. “We used a fully connected neural network and the advanced Asynchronous Advantage Actor-Critic or A3C algorithm, one of the most popular DRL algorithms developed recently, to solve this problem, Joren explains. “The Google Cloud Platform provided the processing power. Training an algorithm (and neural network) of this kind means running it hundreds of times, and an ordinary PC can’t do it. You need to use a supercomputer. The fact that we can currently use DRL with success is not just due to algorithms getting better and better, but also due to the availability of greatly increased processing power.”

The team found that the A3C algorithm was successful in developing a good stock management strategy that achieved the intended goals for a genuine problem with real data and realistic cost functions. Then, of course, the question is how good the solution learned by the algorithm is. How far is this solution from the optimum? “To be able to gauge that, we had to do more than just apply the algorithm to the real-life case. We also used it on a few simple stock management questions whose outcome could be calculated exactly. And what did we find? In simple situations for which robust academic models have been developed, A3C doesn’t always perform better”, Robert tells us. “Actually that’s rather reassuring”, he laughs. “It means that all that academic work hasn’t been for nothing. But as soon as you face a realistic problem, A3C performs just as well or even better.”

Better than state-of-the-art

Robert is enthusiastic, “This is the first study that has irrefutably proved that DRL algorithms can be used to solve complex logistical problems that are too difficult to model. The problem we addressed here has been known for years, but the existing academic models leave much to be desired. That is also the reason why many companies have not yet made the transition to multimodal transport or dual sourcing. They lacked the practical tools.”

“The great thing about DRL is that you can start from scratch”, adds Joren. “As academics, we have been focused on stylised models and rules up to now, and if a company wanted a solution to a real-life problem, we had to try and develop new ones. But DRL algorithms teach themselves to find good stock management rules. The fact that you can generalise DRL to any company-specific situation is incredibly valuable.”

“The A3C model enables companies to perform better than they do with state-of-the-art academic models. They can come closer to optimal results, enabling them to make better decisions to make their logistics more efficient,” Robert concludes.

Source: The paper ‘Can Deep Reinforcement Learning Improve Inventory Management? Performance and Implementation of Dual Sourcing-Mode Problems’ is published on the SSRN website. You can also request a copy from the authors.

About the authors
Joren Gijsbrechts is a doctoral researcher at the Faculty of Economics and Business at KU Leuven. Robert Boute is a full professor in Operations Management at Vlerick Business School and the Faculty of Economics and Business at KU Leuven. Jan A. Van Mieghem is the Harold L. Stuart Distinguished Professor in Managerial Economics and a professor in Operations Management at the Kellogg School of Management at Northwestern University (USA). Dennis J. Zhang is an assistant professor of Operations and Manufacturing Management at Washington University’s Olin Business School in St. Louis (USA).

Accreditations
& Rankings

Equis Association of MBAs AACSB Financial Times