Welcome on the webpage of the reading group "Stochastic Networks and Learning" organized by the Stochastic Operations Research group at the Eindhoven University of Technology!
Learning techniques become more and more relevant to the design and optimization of stochastic networks as the size and complexity of these systems increases. In this reading group, we propose to share our findings on this topic by organizing biweekly meetings where a group member presents a paper of their choice.
This page lists the past and future papers presented at the reading group.
Organizers: Céline Comte and Sem Borst
Reinforcement learning (RL) appeals to many researchers in recent years because of its generality. It is an approach to machine intelligence that learns to achieve the given goal by trial-and-error iterations with its environment. This paper proposes a case-based reinforcement learning algorithm (CRL) for dynamic inventory control in a multi-agent supply-chain system. Traditional time-triggered and event-triggered ordering policies remain popular because they are easy to implement. But in the dynamic environment, the results of them may become inaccurate causing excessive inventory (cost) or shortage. Under the condition of nonstationary customer demand, the S value of (T, S) and (Q, S) inventory review method is learnt using the proposed algorithm for satisfying target service level, respectively. Multi-agent simulation of a simplified two-echelon supply chain, where proposed algorithm is implemented, is run for a few times. The results show the effectiveness of CRL in both review methods. We also consider a framework for general learning method based on proposed one, which may be helpful in all aspects of supply-chain management (SCM). Hence, it is suggested that well-designed ‘‘connections” are necessary to be built between CRL, multi-agent system (MAS) and SCM.
This paper studies strategic interaction in networks. We focus on games of strategic substitutes and strategic complements, and departing from previous literature, we do not assume particular functional forms on players' payoffs. By exploiting variational methods, we show that the uniqueness, the comparative statics, and the approximation of a Nash equilibrium are determined by a precise relationship between the lowest eigenvalue of the network, a measure of players' payoff concavity, and a parameter capturing the strength of the strategic interaction among players. We apply our framework to the study of aggregative network games, games of mixed interactions, and Bayesian network games.
A large number of stochastic networks including loss networks and certain queueing networks have product-form steady-state probabilities. However, for most practical networks, evaluating the system performance is a difficult task due to the presence of a normalization constant. We propose a new framework based on probabilistic graphical models to tackle this task. Specifically, we use factor graphs to model the stationary distribution of a network. For networks with arbitrary topology, we can apply efficient message-passing algorithms like the sum-product algorithm to compute the exact or approximate marginal distributions of all state variables and related performance measures such as blocking probabilities. Through extensive numerical experiments, we show that the sum-product algorithm returns very accurate blocking probabilities and greatly outperforms the reduced load approximation for loss networks with a variety of topologies. The factor graph model also provides a promising approach for analyzing product-form queueing networks.
The newsvendor problem is one of the most basic and widely applied inventory models. If the probability distribution of the demand is known, the problem can be solved analytically. However, approximating the probability distribution is not easy and is prone to error; therefore, the resulting solution to the newsvendor problem may not be optimal. To address this issue, we propose an algorithm based on deep learning that optimizes the order quantities for all products based on features of the demand data. Our algorithm integrates the forecasting and inventory-optimization steps, rather than solving them separately, as is typically done, and does not require knowledge of the probability distributions of the demand. One can view the optimal order quantities as the labels in the deep neural network. However, unlike most deep learning applications, our model does not know the true labels (order quantities), but rather learns them during the training. Numerical experiments on real-world data suggest that our algorithm outperforms other approaches, including data-driven and machine learning approaches, especially for demands with high volatility. Finally, in order to show how this approach can be used for other inventory optimization problems, we provide an extension for (r, Q) policies.
We design a simple reinforcement learning agent that, with a specification only of suitable internal state dynamics and a reward function, can operate with some degree of competence in any environment. The agent maintains visitation counts and value estimates for associated state-action pair. The value function is updated incrementally in response to temporal differences and optimistic boosts that encourage exploration. The agent executes actions that are greedy with respect to this value function. We establish a regret bound demonstrating convergence to near-optimal per-period performance, where the time taken to achieve near-optimality is polynomial in the number of internal states and actions, as well as the reward averaging time of the best policy within the reference policy class, which is comprised of those that depend on history only through the agent's internal state. Notably, there is no further dependence on the number of environment states or mixing times associated with other policies or statistics of history. Our result sheds light on the potential benefits of (deep) representation learning, which has demonstrated the capability to extract compact and relevant features from high-dimensional interaction histories.
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error.
In this work we develop a framework for augmenting online algorithms with a machine learned oracle to achieve competitive ratios that provably improve upon unconditional worst case lower bounds when the oracle has low error. Our approach treats the oracle as a complete black box, and is not dependent on its inner workings, or the exact distribution of its errors.
We apply this framework to the traditional caching problem -- creating an eviction strategy for a cache of size k. We demonstrate that naively following the oracle's recommendations may lead to very poor performance, even when the average error is quite low. Instead we show how to modify the Marker algorithm to take into account the oracle's predictions, and prove that this combined approach achieves a competitive ratio that both (i) decreases as the oracle's error decreases, and (ii) is always capped by O(logk), which can be achieved without any oracle input. We complement our results with an empirical evaluation of our algorithm on real world datasets, and show that it performs well empirically even using simple off-the-shelf predictions.
The successful launch of electric vehicles (EVs) depends critically on the availability of convenient and economic charging facilities. The problem of scheduling of large-scale charging of EVs by a service provider is considered. A Markov decision process model is introduced in which EVs arrive randomly at a charging facility with random demand and completion deadlines. The service provider faces random charging costs, convex non-completion penalties, and a peak power constraint that limits the maximum number of simultaneous activation of EV chargers.
Formulated as a restless multi-armed bandit problem, the EV charging problem is shown to be indexable. A closed-form expression of the Whittle's index is obtained for the case when the charging costs are constant. The Whittle's index policy, however, is not optimal in general. An enhancement of the Whittle's index policy based on spatial interchange according to the less laxity and longer processing time principle is presented. The proposed policy outperforms existing charging algorithms, especially when the charging costs are time varying.
We formulate a model of sequential decision making, dubbed the Goal Prediction game, to study the extent to which an overseeing adversary can predict the final goal of an agent who tries to reach that goal quickly, through a sequence of intermediate actions. Our formulation is motivated by the increasing ubiquity of large-scale surveillance and data collection infrastructures, which can be used to predict an agent’s intentions and future actions, despite the agent’s desire for privacy.
Our main result shows that with a carefully chosen agent strategy, the probability that the agent’s goal is correctly predicted by an adversary can be made inversely proportional to the time that the agent is willing to spend in reaching the goal, but cannot be made any smaller than that. Moreover, this characterization depends on the topology of the agent’s state space only through its diameter.
Recently, efforts were made to standardize Signal Phase and Timing (SPaT) messages. Such messages contain the current signal phase with a prediction for the corresponding residual time for all approaches of a signalized intersection. Hence, the information can be utilized for the motion planning of human-driven/autonomously operated individual or public transport vehicles. Consequently, this leads to a more homogeneous traffic flow and a smoother speed profile. Unfortunately, adaptive signal control systems make it difficult to predict the SPaT information accurately. In this paper, we propose a novel machine learning approach to forecast the time series of residual times. A prediction framework that utilizes a Random Survival Forest (RSF) and a Long-Short-Term-Memory (LSTM) neural network is implemented. The machine learning models are compared to a Linear Regression (LR) model. For a proof of concept, the models are applied to a case study in the city of Zurich. Results show that the machine learning models outperform the LR approach, and in particular, the LSTM neural network is a promising tool for the enhancement of SPaT messages.
Novel advanced policy gradient (APG) methods, such as Trust Region policy optimization and Proximal policy optimization (PPO), have become the dominant reinforcement learning algorithms because of their ease of implementation and good practical performance. A conventional setup for notoriously difficult queueing network control problems is a Markov decision problem (MDP) that has three features: infinite state space, unbounded costs, and long-run average cost objective. We extend the theoretical framework of these APG methods for such MDP problems. The resulting PPO algorithm is tested on a parallel-server system and large-size multiclass queueing networks. The algorithm consistently generates control policies that outperform state-of-art heuristics in literature in a variety of load conditions from light to heavy traffic. These policies are demonstrated to be near-optimal when the optimal policy can be computed.
A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function via sampling. First, we use a discounted relative value function as an approximation of the relative value function. Second, we propose regenerative simulation to estimate the discounted relative value function. Finally, we incorporate the approximating martingale-process method into the regenerative estimator.
Consider n random variables forming a Markov random field (MRF). The true model of the MRF is unknown, and it is assumed to belong to a binary set. The objective is to sequentially sample the random variables (one-at-a-time) such that the true MRF model can be detected with the fewest number of samples, while in parallel, the decision reliability is controlled. The core element of an optimal decision process is a rule for selecting and sampling the random variables over time. Such a process, at every time instant and adaptively to the collected data, selects the random variable that is expected to be most informative about the model, rendering an overall minimized number of samples required for reaching a reliable decision. The existing studies on detecting MRF structures generally sample the entire network at the same time and focus on designing optimal detection rules without regard to the data-acquisition process. This paper characterizes the sampling process for general MRFs, which, in conjunction with the sequential probability ratio test, is shown to be optimal in the asymptote of large n. The critical insight in designing the sampling process is devising an information measure that captures the decisions' inherent statistical dependence over time. Furthermore, when the MRFs can be modeled by acyclic probabilistic graphical models, the sampling rule is shown to take a computationally simple form. Performance analysis for the general case is provided, and the results are interpreted in several special cases: Gaussian MRFs, non-asymptotic regimes, connection to Chernoff's rule to controlled (active) sensing, and the problem of cluster detection.
Graph learning is an inference problem of estimating connectivity of a graph from a collection of epidemic cascades, with many useful applications in the areas of online/offline social networks, p2p networks, computer security, and epidemiology. We consider a practical scenario when the information of cascade samples are partially observed in the independent cascade (IC) model. For the graph learning problem, we propose an efficient algorithm that solves a localized version of computationally-intractable maximum likelihood estimation through approximations in both temporal and spatial aspects. Our algorithm iterates the operations of recovering missing time logs and inferring graph connectivity, and thereby progressively improves the inference quality. We study the sample complexity, which is the number of required cascade samples to meet a given inference quality, and show that it is asymptotically close to a lower bound, thus near-order-optimal in terms of the number of nodes. We evaluate the performance of our algorithm using five real-world social networks, whose size ranges from 20 to 900, and demonstrate that our algorithm performs better than other competing algorithms in terms of accuracy while maintaining fast running time.
Once or twice a month, a different group member gives a lecture on a document (research paper, survey paper, book chapter...) of their choice. The other group members are encouraged to ask questions and initiate discussions during the lecture. The average workload per member is very light since they only present a paper about once every [number of participants]/2 months.
The objective is to make the presentation as interactive as possible despite the online format. Therefore, we propose that the speaker uses one of the following two media: