Welcome on the webpage of the reading group "Stochastic Networks and Learning" organized by the Stochastic Operations Research group at the Eindhoven University of Technology!
Learning techniques become more and more relevant to the design and optimization of stochastic networks as the size and complexity of these systems increases. In this reading group, we propose to share our findings on this topic by organizing biweekly meetings where a group member presents a paper of their choice.
This page lists the past and future papers presented at the reading group.
Once or twice a month, a member of the reading group gives a lecture on a document (research paper, survey paper, book chapter...) of their choice. During this lecture, the rest of the group is encouraged to ask questions and initiate discussions. The average workload per member is very light since they only present a paper about once every [number of participants]/2 months.
The objective is to make the presentation as interactive as possible despite the online format. Therefore, we propose that the speaker uses one of the following two media:
Organizers: Elene Anton, Jaap Storm, Céline Comte, and Sem Borst
The intelligent traffic light control is critical for an efficient transportation system. While existing traffic lights are mostly operated by hand-crafted rules, an intelligent traffic light control system should be dynamically adjusted to real-time traffic. There is an emerging trend of using deep reinforcement learning technique for traffic light control and recent studies have shown promising results. However, existing studies have not yet tested the methods on the real-world traffic data and they only focus on studying the rewards without interpreting the policies. In this paper, we propose a more effective deep reinforcement learning model for traffic light control. We test our method on a large-scale real traffic dataset obtained from surveillance cameras. We also show some interesting case studies of policies learned from the real data.
We first review recent challenges in reinforcement learning applied to stochastic networks.
We then give a focus on reinforcement learning control problems under the average reward criterion in which non-zero rewards are both sparse and rare, that is, they occur in very few states and the steady-state probability of observing them is extremely small. For such problems, we propose a new approach to exploit prior knowledge of the sparse structure of the environment. The core idea is to use Renewal Theory and Fleming-Viot particle systems to construct estimators of quantities relevant to reinforcement learning, for which we provide some theoretical guarantees for the estimation error.
Paper 1 Power networks, responsible for transporting electricity across large geographical regions, are complex infrastructures on which modern life critically depend. Variations in demand and production profiles, with increasing renewable energy integration, as well as the high voltage network technology, constitute a real challenge for human operators when optimizing electricity transportation while avoiding blackouts. Motivated to investigate the potential of AI methods in enabling adaptability in power network operation, we have designed a L2RPN challenge to encourage the development of reinforcement learning solutions to key problems present in the next-generation power networks. The NeurIPS 2020 competition was well received by the international community attracting over 300 participants worldwide.
The main contribution of this challenge is our proposed comprehensive 'Grid2Op' framework, and associated benchmark, which plays realistic sequential network operations scenarios. The Grid2Op framework, which is open-source and easily re-usable, allows users to define new environments with its companion GridAlive ecosystem. Grid2Op relies on existing non-linear physical power network simulators and let users create a series of perturbations and challenges that are representative of two important problems: a) the uncertainty resulting from the increased use of unpredictable renewable energy sources, and b) the robustness required with contingent line disconnections. In this paper, we give the competition highlights. We present the benchmark suite and analyse the winning solutions, including one super-human performance demonstration. We propose our organizational insights for a successful competition and conclude on open research avenues. Given the challenge success, we expect our work will foster research to create more sustainable solutions for power network operations.
Paper 2 Safe and reliable electricity transmission in power grids is crucial for modern society. It is thus quite natural that there has been a growing interest in the automatic management of power grids, exemplified by the Learning to Run a Power Network Challenge (L2RPN), modeling the problem as a reinforcement learning (RL) task. However, it is highly challenging to manage a real-world scale power grid, mostly due to the massive scale of its state and action space. In this paper, we present an off-policy actor-critic approach that effectively tackles the unique challenges in power grid management by RL, adopting the hierarchical policy together with the afterstate representation. Our agent ranked first in the latest challenge (L2RPN WCCI 2020), being able to avoid disastrous situations while maintaining the highest level of operational efficiency in every test scenarios. This paper provides a formal description of the algorithmic aspect of our approach, as well as further experimental studies on diverse power grids.
Motivated by diverse application areas such as healthcare, call centers, and crowdsourcing, we consider the design and operation of service systems that process tasks with types that are ex ante unknown, and employ servers with different skill sets. Our benchmark model involves two types of tasks, Easy and Hard, and servers that are either Junior or Senior in their abilities. The service provider determines a resource allocation policy, i.e., how to assign tasks to servers over time, with the goal of maximizing the system’s long-term throughput. Information about a task’s type can only be obtained while serving it. In particular, the more time a Junior server spends on a task without service completion, the higher her belief that the task is Hard and thus needs to be rerouted to a Senior server. This interplay between service time and task-type uncertainty implies that the system’s resource allocation policy and staffing levels implicitly determine how the provider prioritizes between learning and actually serving. We show that the performance loss due to the uncertainty in task types can be significant and, interestingly, that the system’s stability region is largely dependent on the rate at which information about tasks’ types is generated. Furthermore, we consider endogenizing the servers’ capabilities: assuming that training is costly, we explore the problem of jointly optimizing over the training levels of the system’s server pools, the staffing levels, and the resource allocation policy. We find that among optimal designs there always exists one with a “hierarchical” structure, where all tasks are initially routed to the least skilled servers and then progressively move to more skilled ones, if necessary. Comparative statics indicate that uncertainty in task types leads to significantly higher staffing cost and less specialized server pools.
In this paper, we study learning-assisted multi-user scheduling for the wireless downlink. There have been many scheduling algorithms developed that optimize for a plethora of performance metrics; however a systematic approach across diverse performance metrics and deployment scenarios is still lacking. We address this by developing a meta-scheduler – given a diverse collection of schedulers, we develop a learning-based overlay algorithm (meta-scheduler) that selects that ``best'' scheduler from amongst these for each deployment scenario. More formally, we develop a multi-armed bandit (MAB) framework for meta-scheduling that assigns and adapts a score for each scheduler to maximize reward (e.g., mean delay, timely throughput etc.). The meta-scheduler is based on a variant of the Upper Confidence Bound algorithm (UCB), but adapted to interrupt the queuing dynamics at the base-station so as to filter out schedulers that might render the system unstable. We show that the algorithm has a poly-logarithmic regret in the expected reward with respect to a genie that chooses the optimal scheduler for each scenario. Finally through simulation, we show that the meta-scheduler learns the choice of the scheduler to best adapt to the deployment scenario (e.g. load conditions, performance metrics).
Paper 1 The Whittle index policy is a heuristic that has shown remarkable good performance (with guaranted asymptotic optimality) when applied to the class of problems known as multi-armed restless bandits. In this paper we develop QWI, an algorithm based on Q-learning in order to learn the Whittle indices. The key feature is the deployment of two time-scales, a relatively faster one to update the state-action Q-functions, and a relatively slower one to update the Whittle indices. In our main result, we show that the algorithm converges to the Whittle indices of the problem. Numerical computations show that our algorithm converges much faster than both the standard Q-learning algorithm as well as neural-network based approximate Q-learning.
Paper 2 A novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage the structure of the Whittle index policy to reduce the search space of Q-learning, resulting in major computational gains. Rigorous convergence analysis is provided, supported by numerical experiments. The numerical experiments show excellent empirical performance of the proposed scheme.
With the rapid advance of information technology, network systems have become increasingly complex and hence the underlying system dynamics are often unknown or difficult to characterize. Finding a good network control policy is of significant importance to achieve desirable network performance (e.g., high throughput or low delay). In this work, we consider using model-based reinforcement learning (RL) to learn the optimal control policy for queueing networks so that the average job delay (or equivalently the average queue backlog) is minimized. Traditional approaches in RL, however, cannot handle the unbounded state spaces of the network control problem. To overcome this difficulty, we propose a new algorithm, called Reinforcement Learning for Queueing Networks (RL-QN), which applies model-based RL methods over a finite subset of the state space, while applying a known stabilizing policy for the rest of the states. We establish that the average queue backlog under RL-QN with an appropriately constructed subset can be arbitrarily close to the optimal result. We evaluate RL-QN in dynamic server allocation, routing and switching problems. Simulation results show that RL-QN minimizes the average queue backlog effectively.
We propose a new diffusion-asymptotic analysis for sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. In an experiment with n time steps, we let the mean reward gaps between actions scale to the order 1/√n so as to preserve the difficulty of the learning task as n grows. In this regime, we show that the behavior of a class of sequentially randomized Markov experiments converges to a diffusion limit, given as the solution of a stochastic differential equation. The diffusion limit thus enables us to derive refined, instance-specific characterization of the stochastic dynamics of adaptive experiments. As an application of this framework, we use the diffusion limit to obtain several new insights on the regret and belief evolution of Thompson sampling. We show that a version of Thompson sampling with an asymptotically uninformative prior variance achieves nearly-optimal instance-specific regret scaling when the reward gaps are relatively large. We also demonstrate that, in this regime, the posterior beliefs underlying Thompson sampling are highly unstable over time.
Motivated by packet routing in computer networks, online queuing systems are composed of queues receiving packets at different rates. Repeatedly, they send packets to servers, each of them treating only at most one packet at a time. In the centralized case, the number of accumulated packets remains bounded (i.e., the system is stable) as long as the ratio between service rates and arrival rates is larger than 1. In the decentralized case, individual no-regret strategies ensures stability when this ratio is larger than 2. Yet, myopically minimizing regret disregards the long term effects due to the carryover of packets to further rounds. On the other hand, minimizing long term costs leads to stable Nash equilibria as soon as the ratio exceeds e/(e−1). Stability with decentralized learning strategies with a ratio below 2 was a major remaining question. We first argue that for ratios up to 2, cooperation is required for stability of learning strategies, as selfish minimization of policy regret, a patient notion of regret, might indeed still be unstable in this case. We therefore consider cooperative queues and propose the first learning decentralized algorithm guaranteeing stability of the system as long as the ratio of rates is larger than 1, thus reaching performances comparable to centralized strategies.
A matching in a two-sided market often incurs an externality: a matched resource may become unavailable to the other side of the market, at least for a while. This is especially an issue in online platforms involving human experts, as the expert resources are often scarce. The efficient utilization of experts in these platforms is made challenging by the fact that the information available about the parties involved is usually limited. To address this challenge, we develop a model of a task-expert matching system where a task is matched to an expert using not only the prior information about the task but also the feedback obtained from the past matches. In our model, the tasks arrive online while the experts are fixed and constrained by a finite service capacity. For this model, we characterize the maximum task resolution throughput a platform can achieve. We show that the natural greedy approach where each expert is assigned a task most suitable to his or her skill is suboptimal, as it does not internalize the aforementioned externality. We develop a throughput-optimal backpressure algorithm which does so by accounting for the “congestion” among different task types. Finally, we validate our model and confirm our theoretical findings with data-driven simulations via logs of Math.StackExchange.com, a Stack Overflow forum dedicated to mathematics.
Organizers: Céline Comte and Sem Borst
Reinforcement learning (RL) appeals to many researchers in recent years because of its generality. It is an approach to machine intelligence that learns to achieve the given goal by trial-and-error iterations with its environment. This paper proposes a case-based reinforcement learning algorithm (CRL) for dynamic inventory control in a multi-agent supply-chain system. Traditional time-triggered and event-triggered ordering policies remain popular because they are easy to implement. But in the dynamic environment, the results of them may become inaccurate causing excessive inventory (cost) or shortage. Under the condition of nonstationary customer demand, the S value of (T, S) and (Q, S) inventory review method is learnt using the proposed algorithm for satisfying target service level, respectively. Multi-agent simulation of a simplified two-echelon supply chain, where proposed algorithm is implemented, is run for a few times. The results show the effectiveness of CRL in both review methods. We also consider a framework for general learning method based on proposed one, which may be helpful in all aspects of supply-chain management (SCM). Hence, it is suggested that well-designed ‘‘connections” are necessary to be built between CRL, multi-agent system (MAS) and SCM.
This paper studies strategic interaction in networks. We focus on games of strategic substitutes and strategic complements, and departing from previous literature, we do not assume particular functional forms on players' payoffs. By exploiting variational methods, we show that the uniqueness, the comparative statics, and the approximation of a Nash equilibrium are determined by a precise relationship between the lowest eigenvalue of the network, a measure of players' payoff concavity, and a parameter capturing the strength of the strategic interaction among players. We apply our framework to the study of aggregative network games, games of mixed interactions, and Bayesian network games.
A large number of stochastic networks including loss networks and certain queueing networks have product-form steady-state probabilities. However, for most practical networks, evaluating the system performance is a difficult task due to the presence of a normalization constant. We propose a new framework based on probabilistic graphical models to tackle this task. Specifically, we use factor graphs to model the stationary distribution of a network. For networks with arbitrary topology, we can apply efficient message-passing algorithms like the sum-product algorithm to compute the exact or approximate marginal distributions of all state variables and related performance measures such as blocking probabilities. Through extensive numerical experiments, we show that the sum-product algorithm returns very accurate blocking probabilities and greatly outperforms the reduced load approximation for loss networks with a variety of topologies. The factor graph model also provides a promising approach for analyzing product-form queueing networks.
The newsvendor problem is one of the most basic and widely applied inventory models. If the probability distribution of the demand is known, the problem can be solved analytically. However, approximating the probability distribution is not easy and is prone to error; therefore, the resulting solution to the newsvendor problem may not be optimal. To address this issue, we propose an algorithm based on deep learning that optimizes the order quantities for all products based on features of the demand data. Our algorithm integrates the forecasting and inventory-optimization steps, rather than solving them separately, as is typically done, and does not require knowledge of the probability distributions of the demand. One can view the optimal order quantities as the labels in the deep neural network. However, unlike most deep learning applications, our model does not know the true labels (order quantities), but rather learns them during the training. Numerical experiments on real-world data suggest that our algorithm outperforms other approaches, including data-driven and machine learning approaches, especially for demands with high volatility. Finally, in order to show how this approach can be used for other inventory optimization problems, we provide an extension for (r, Q) policies.
We design a simple reinforcement learning agent that, with a specification only of suitable internal state dynamics and a reward function, can operate with some degree of competence in any environment. The agent maintains visitation counts and value estimates for associated state-action pair. The value function is updated incrementally in response to temporal differences and optimistic boosts that encourage exploration. The agent executes actions that are greedy with respect to this value function. We establish a regret bound demonstrating convergence to near-optimal per-period performance, where the time taken to achieve near-optimality is polynomial in the number of internal states and actions, as well as the reward averaging time of the best policy within the reference policy class, which is comprised of those that depend on history only through the agent's internal state. Notably, there is no further dependence on the number of environment states or mixing times associated with other policies or statistics of history. Our result sheds light on the potential benefits of (deep) representation learning, which has demonstrated the capability to extract compact and relevant features from high-dimensional interaction histories.
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error.
In this work we develop a framework for augmenting online algorithms with a machine learned oracle to achieve competitive ratios that provably improve upon unconditional worst case lower bounds when the oracle has low error. Our approach treats the oracle as a complete black box, and is not dependent on its inner workings, or the exact distribution of its errors.
We apply this framework to the traditional caching problem -- creating an eviction strategy for a cache of size k. We demonstrate that naively following the oracle's recommendations may lead to very poor performance, even when the average error is quite low. Instead we show how to modify the Marker algorithm to take into account the oracle's predictions, and prove that this combined approach achieves a competitive ratio that both (i) decreases as the oracle's error decreases, and (ii) is always capped by O(logk), which can be achieved without any oracle input. We complement our results with an empirical evaluation of our algorithm on real world datasets, and show that it performs well empirically even using simple off-the-shelf predictions.
The successful launch of electric vehicles (EVs) depends critically on the availability of convenient and economic charging facilities. The problem of scheduling of large-scale charging of EVs by a service provider is considered. A Markov decision process model is introduced in which EVs arrive randomly at a charging facility with random demand and completion deadlines. The service provider faces random charging costs, convex non-completion penalties, and a peak power constraint that limits the maximum number of simultaneous activation of EV chargers.
Formulated as a restless multi-armed bandit problem, the EV charging problem is shown to be indexable. A closed-form expression of the Whittle's index is obtained for the case when the charging costs are constant. The Whittle's index policy, however, is not optimal in general. An enhancement of the Whittle's index policy based on spatial interchange according to the less laxity and longer processing time principle is presented. The proposed policy outperforms existing charging algorithms, especially when the charging costs are time varying.
We formulate a model of sequential decision making, dubbed the Goal Prediction game, to study the extent to which an overseeing adversary can predict the final goal of an agent who tries to reach that goal quickly, through a sequence of intermediate actions. Our formulation is motivated by the increasing ubiquity of large-scale surveillance and data collection infrastructures, which can be used to predict an agent’s intentions and future actions, despite the agent’s desire for privacy.
Our main result shows that with a carefully chosen agent strategy, the probability that the agent’s goal is correctly predicted by an adversary can be made inversely proportional to the time that the agent is willing to spend in reaching the goal, but cannot be made any smaller than that. Moreover, this characterization depends on the topology of the agent’s state space only through its diameter.
Recently, efforts were made to standardize Signal Phase and Timing (SPaT) messages. Such messages contain the current signal phase with a prediction for the corresponding residual time for all approaches of a signalized intersection. Hence, the information can be utilized for the motion planning of human-driven/autonomously operated individual or public transport vehicles. Consequently, this leads to a more homogeneous traffic flow and a smoother speed profile. Unfortunately, adaptive signal control systems make it difficult to predict the SPaT information accurately. In this paper, we propose a novel machine learning approach to forecast the time series of residual times. A prediction framework that utilizes a Random Survival Forest (RSF) and a Long-Short-Term-Memory (LSTM) neural network is implemented. The machine learning models are compared to a Linear Regression (LR) model. For a proof of concept, the models are applied to a case study in the city of Zurich. Results show that the machine learning models outperform the LR approach, and in particular, the LSTM neural network is a promising tool for the enhancement of SPaT messages.
Novel advanced policy gradient (APG) methods, such as Trust Region policy optimization and Proximal policy optimization (PPO), have become the dominant reinforcement learning algorithms because of their ease of implementation and good practical performance. A conventional setup for notoriously difficult queueing network control problems is a Markov decision problem (MDP) that has three features: infinite state space, unbounded costs, and long-run average cost objective. We extend the theoretical framework of these APG methods for such MDP problems. The resulting PPO algorithm is tested on a parallel-server system and large-size multiclass queueing networks. The algorithm consistently generates control policies that outperform state-of-art heuristics in literature in a variety of load conditions from light to heavy traffic. These policies are demonstrated to be near-optimal when the optimal policy can be computed.
A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function via sampling. First, we use a discounted relative value function as an approximation of the relative value function. Second, we propose regenerative simulation to estimate the discounted relative value function. Finally, we incorporate the approximating martingale-process method into the regenerative estimator.
Consider n random variables forming a Markov random field (MRF). The true model of the MRF is unknown, and it is assumed to belong to a binary set. The objective is to sequentially sample the random variables (one-at-a-time) such that the true MRF model can be detected with the fewest number of samples, while in parallel, the decision reliability is controlled. The core element of an optimal decision process is a rule for selecting and sampling the random variables over time. Such a process, at every time instant and adaptively to the collected data, selects the random variable that is expected to be most informative about the model, rendering an overall minimized number of samples required for reaching a reliable decision. The existing studies on detecting MRF structures generally sample the entire network at the same time and focus on designing optimal detection rules without regard to the data-acquisition process. This paper characterizes the sampling process for general MRFs, which, in conjunction with the sequential probability ratio test, is shown to be optimal in the asymptote of large n. The critical insight in designing the sampling process is devising an information measure that captures the decisions' inherent statistical dependence over time. Furthermore, when the MRFs can be modeled by acyclic probabilistic graphical models, the sampling rule is shown to take a computationally simple form. Performance analysis for the general case is provided, and the results are interpreted in several special cases: Gaussian MRFs, non-asymptotic regimes, connection to Chernoff's rule to controlled (active) sensing, and the problem of cluster detection.
Graph learning is an inference problem of estimating connectivity of a graph from a collection of epidemic cascades, with many useful applications in the areas of online/offline social networks, p2p networks, computer security, and epidemiology. We consider a practical scenario when the information of cascade samples are partially observed in the independent cascade (IC) model. For the graph learning problem, we propose an efficient algorithm that solves a localized version of computationally-intractable maximum likelihood estimation through approximations in both temporal and spatial aspects. Our algorithm iterates the operations of recovering missing time logs and inferring graph connectivity, and thereby progressively improves the inference quality. We study the sample complexity, which is the number of required cascade samples to meet a given inference quality, and show that it is asymptotically close to a lower bound, thus near-order-optimal in terms of the number of nodes. We evaluate the performance of our algorithm using five real-world social networks, whose size ranges from 20 to 900, and demonstrate that our algorithm performs better than other competing algorithms in terms of accuracy while maintaining fast running time.