Reading group "Stochastic Networks and Learning"

Welcome on the webpage of the reading group "Stochastic Networks and Learning" organized by the Stochastic Operations Research group at the Eindhoven University of Technology!

Learning techniques become more and more relevant to the design and optimization of stochastic networks as the size and complexity of these systems increases. In this reading group, we propose to share our findings on this topic by organizing biweekly meetings where a group member presents a paper of their choice.

This page lists the past and future papers presented at the reading group.

Organization

Format: Biweekly one-hour-and-a-half meetings

Once or twice a month, a member of the reading group gives a lecture on a document (research paper, survey paper, book chapter...) of their choice. During this lecture, the rest of the group is encouraged to ask questions and initiate discussions. The average workload per member is very light since they only present a paper about once every [number of participants]/2 months.

Expected benefits

Broaden our research perspectives by discovering learning techniques that are relevant to our research field.
Foster interaction among members of our group.
Help integrate new members (which is all the more important since we work remotely).

How to choose a document?

Any document (research paper, survey paper, book chapter…) on stochastic networks or learning, preferably both.
Ideally, select a document that you would read anyway and use the reading group as an additional motivation to go deeper into the detail.
Please select a topic that you are not (yet) too familiar with. In particular, you cannot present a document that you wrote! In this way, you won’t omit details and the audience will follow you more easily.
You can contact the organizers if you would like to receive suggestions of documents.

Medium

The objective is to make the presentation as interactive as possible despite the online format. Therefore, we propose that the speaker uses one of the following two media:

Pen tablet: We can lend a pen tablet, but it may require a few hours training if you have never used it.
Blackboard: We can help set up the recording material at the university.

In terms of timing, you can aim for a one-hour presentation (approximately) and leave time for questions and discussions.

Academic year 2021–2022

Organizers: Elene Anton, Jaap Storm, Céline Comte, and Sem Borst

IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control. Hua Wei, Guanjie Zheng, Huaxiu Yao, and Zhenhui Li
Twenty-fourth SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM (2018)
Presenter: Purva Joshi – Date: June 28, 2022

The intelligent traffic light control is critical for an efficient transportation system. While existing traffic lights are mostly operated by hand-crafted rules, an intelligent traffic light control system should be dynamically adjusted to real-time traffic. There is an emerging trend of using deep reinforcement learning technique for traffic light control and recent studies have shown promising results. However, existing studies have not yet tested the methods on the real-world traffic data and they only focus on studying the rewards without interpreting the policies. In this paper, we propose a more effective deep reinforcement learning model for traffic light control. We test our method on a large-scale real traffic dataset obtained from surveillance cameras. We also show some interesting case studies of policies learned from the real data.
New exploration mechanisms for sparse and rare rewards.
Presenter: Matthieu Jonckheere – Date: June 14, 2022

We first review recent challenges in reinforcement learning applied to stochastic networks.

We then give a focus on reinforcement learning control problems under the average reward criterion in which non-zero rewards are both sparse and rare, that is, they occur in very few states and the steady-state probability of observing them is extremely small. For such problems, we propose a new approach to exploit prior knowledge of the sparse structure of the environment. The core idea is to use Renewal Theory and Fleming-Viot particle systems to construct estimators of quantities relevant to reinforcement learning, for which we provide some theoretical guarantees for the estimation error.
Reinforcement Learning for Electricity Network Operation. Adrian Kelly, Aidan O’Sullivan, Patrick de Mars, and Antoine Maro
NeurIPS L2RPN challenge (2020)
Winning the L2RPN Challenge: Power Grid Management via Semi-Markov Afterstate Actor-Critic. Deunsol Yoon, Sunghoon Hong, Byung-Jun Lee, Kee-Eung Kim
International Conference on Learning Representations (2020)
Presenter: Erica van der Sar – Date: May 31, 2022

Paper 1 Power networks, responsible for transporting electricity across large geographical regions, are complex infrastructures on which modern life critically depend. Variations in demand and production profiles, with increasing renewable energy integration, as well as the high voltage network technology, constitute a real challenge for human operators when optimizing electricity transportation while avoiding blackouts. Motivated to investigate the potential of AI methods in enabling adaptability in power network operation, we have designed a L2RPN challenge to encourage the development of reinforcement learning solutions to key problems present in the next-generation power networks. The NeurIPS 2020 competition was well received by the international community attracting over 300 participants worldwide.
The main contribution of this challenge is our proposed comprehensive 'Grid2Op' framework, and associated benchmark, which plays realistic sequential network operations scenarios. The Grid2Op framework, which is open-source and easily re-usable, allows users to define new environments with its companion GridAlive ecosystem. Grid2Op relies on existing non-linear physical power network simulators and let users create a series of perturbations and challenges that are representative of two important problems: a) the uncertainty resulting from the increased use of unpredictable renewable energy sources, and b) the robustness required with contingent line disconnections. In this paper, we give the competition highlights. We present the benchmark suite and analyse the winning solutions, including one super-human performance demonstration. We propose our organizational insights for a successful competition and conclude on open research avenues. Given the challenge success, we expect our work will foster research to create more sustainable solutions for power network operations.

Paper 2 Safe and reliable electricity transmission in power grids is crucial for modern society. It is thus quite natural that there has been a growing interest in the automatic management of power grids, exempliﬁed by the Learning to Run a Power Network Challenge (L2RPN), modeling the problem as a reinforcement learning (RL) task. However, it is highly challenging to manage a real-world scale power grid, mostly due to the massive scale of its state and action space. In this paper, we present an off-policy actor-critic approach that effectively tackles the unique challenges in power grid management by RL, adopting the hierarchical policy together with the afterstate representation. Our agent ranked ﬁrst in the latest challenge (L2RPN WCCI 2020), being able to avoid disastrous situations while maintaining the highest level of operational efﬁciency in every test scenarios. This paper provides a formal description of the algorithmic aspect of our approach, as well as further experimental studies on diverse power grids.
Learning and Hierarchies in Service Systems. Kostas Bimpikis and Mihalis G. Markakis
Management Science, INFORMS (2018)
Presenter: Mark Christianen – Date: May 17, 2022

Motivated by diverse application areas such as healthcare, call centers, and crowdsourcing, we consider the design and operation of service systems that process tasks with types that are ex ante unknown, and employ servers with different skill sets. Our benchmark model involves two types of tasks, Easy and Hard, and servers that are either Junior or Senior in their abilities. The service provider determines a resource allocation policy, i.e., how to assign tasks to servers over time, with the goal of maximizing the system’s long-term throughput. Information about a task’s type can only be obtained while serving it. In particular, the more time a Junior server spends on a task without service completion, the higher her belief that the task is Hard and thus needs to be rerouted to a Senior server. This interplay between service time and task-type uncertainty implies that the system’s resource allocation policy and staffing levels implicitly determine how the provider prioritizes between learning and actually serving. We show that the performance loss due to the uncertainty in task types can be significant and, interestingly, that the system’s stability region is largely dependent on the rate at which information about tasks’ types is generated. Furthermore, we consider endogenizing the servers’ capabilities: assuming that training is costly, we explore the problem of jointly optimizing over the training levels of the system’s server pools, the staffing levels, and the resource allocation policy. We find that among optimal designs there always exists one with a “hierarchical” structure, where all tasks are initially routed to the least skilled servers and then progressively move to more skilled ones, if necessary. Comparative statics indicate that uncertainty in task types leads to significantly higher staffing cost and less specialized server pools.
Meta-Scheduling for the Wireless Downlink Through Learning With Bandit Feedback. Jianhan Song, Gustavo de Veciana, and Sanjay Shakkottai
Transactions on Networking, IEEE/ACM (2021)
Presenter: Lucas van Kreveld – Date: May 3, 2022

In this paper, we study learning-assisted multi-user scheduling for the wireless downlink. There have been many scheduling algorithms developed that optimize for a plethora of performance metrics; however a systematic approach across diverse performance metrics and deployment scenarios is still lacking. We address this by developing a meta-scheduler – given a diverse collection of schedulers, we develop a learning-based overlay algorithm (meta-scheduler) that selects that ``best'' scheduler from amongst these for each deployment scenario. More formally, we develop a multi-armed bandit (MAB) framework for meta-scheduling that assigns and adapts a score for each scheduler to maximize reward (e.g., mean delay, timely throughput etc.). The meta-scheduler is based on a variant of the Upper Confidence Bound algorithm (UCB), but adapted to interrupt the queuing dynamics at the base-station so as to filter out schedulers that might render the system unstable. We show that the algorithm has a poly-logarithmic regret in the expected reward with respect to a genie that chooses the optimal scheduler for each scenario. Finally through simulation, we show that the meta-scheduler learns the choice of the scheduler to best adapt to the deployment scenario (e.g. load conditions, performance metrics).
QWI: Q-learning with Whittle Index. Francisco Robledo, Vivek Borkar, Urtzi Ayesta, and Konstantin Avrachenkov
Reinforcement Learning in Networks and Queues (RLNQ 2021) workshop at ACM Sigmetrics (2021)
Whittle index based Q-learning for restless bandits with average reward. Konstantin E. Avrachenkov, Vivek S. Borkar
arxiv:2004.14427 (2021)
Presenter: Wessel Blomerus – Date: April 5, 2022

Paper 1 The Whittle index policy is a heuristic that has shown remarkable good performance (with guaranted asymptotic optimality) when applied to the class of problems known as multi-armed restless bandits. In this paper we develop QWI, an algorithm based on Q-learning in order to learn the Whittle indices. The key feature is the deployment of two time-scales, a relatively faster one to update the state-action Q-functions, and a relatively slower one to update the Whittle indices. In our main result, we show that the algorithm converges to the Whittle indices of the problem. Numerical computations show that our algorithm converges much faster than both the standard Q-learning algorithm as well as neural-network based approximate Q-learning.

Paper 2 A novel reinforcement learning algorithm is introduced for multiarmed restless bandits with average reward, using the paradigms of Q-learning and Whittle index. Specifically, we leverage the structure of the Whittle index policy to reduce the search space of Q-learning, resulting in major computational gains. Rigorous convergence analysis is provided, supported by numerical experiments. The numerical experiments show excellent empirical performance of the proposed scheme.
RL-QN: A Reinforcement Learning Framework for Optimal Control of Queueing Systems. Bai Liu, Qiaomin Xie, Eytan Modiano
arxiv:2011.07401 (2020)
Presenter: Tim Engels – Date: March 22, 2022

With the rapid advance of information technology, network systems have become increasingly complex and hence the underlying system dynamics are often unknown or difficult to characterize. Finding a good network control policy is of significant importance to achieve desirable network performance (e.g., high throughput or low delay). In this work, we consider using model-based reinforcement learning (RL) to learn the optimal control policy for queueing networks so that the average job delay (or equivalently the average queue backlog) is minimized. Traditional approaches in RL, however, cannot handle the unbounded state spaces of the network control problem. To overcome this difficulty, we propose a new algorithm, called Reinforcement Learning for Queueing Networks (RL-QN), which applies model-based RL methods over a finite subset of the state space, while applying a known stabilizing policy for the rest of the states. We establish that the average queue backlog under RL-QN with an appropriately constructed subset can be arbitrarily close to the optimal result. We evaluate RL-QN in dynamic server allocation, routing and switching problems. Simulation results show that RL-QN minimizes the average queue backlog effectively.
Diffusion Asymptotics for Sequential Experiments. Stefan Wager and Kuang Xu
arxiv:2101.09855 (2021)
Presenter: Gianluca Kosmella – Date: March 8, 2022

We propose a new diffusion-asymptotic analysis for sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. In an experiment with n time steps, we let the mean reward gaps between actions scale to the order 1/√n so as to preserve the difficulty of the learning task as n grows. In this regime, we show that the behavior of a class of sequentially randomized Markov experiments converges to a diffusion limit, given as the solution of a stochastic differential equation. The diffusion limit thus enables us to derive refined, instance-specific characterization of the stochastic dynamics of adaptive experiments. As an application of this framework, we use the diffusion limit to obtain several new insights on the regret and belief evolution of Thompson sampling. We show that a version of Thompson sampling with an asymptotically uninformative prior variance achieves nearly-optimal instance-specific regret scaling when the reward gaps are relatively large. We also demonstrate that, in this regime, the posterior beliefs underlying Thompson sampling are highly unstable over time.
Decentralized Learning in Online Queuing Systems. Flore Sentenac, Etienne Boursier, and Vianney Perchet
Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) (2021)
Presenter: Albert Senen Cerda – Date: February 15, 2022

Motivated by packet routing in computer networks, online queuing systems are composed of queues receiving packets at different rates. Repeatedly, they send packets to servers, each of them treating only at most one packet at a time. In the centralized case, the number of accumulated packets remains bounded (i.e., the system is stable) as long as the ratio between service rates and arrival rates is larger than 1. In the decentralized case, individual no-regret strategies ensures stability when this ratio is larger than 2. Yet, myopically minimizing regret disregards the long term effects due to the carryover of packets to further rounds. On the other hand, minimizing long term costs leads to stable Nash equilibria as soon as the ratio exceeds e/(e−1). Stability with decentralized learning strategies with a ratio below 2 was a major remaining question. We first argue that for ratios up to 2, cooperation is required for stability of learning strategies, as selfish minimization of policy regret, a patient notion of regret, might indeed still be unstable in this case. We therefore consider cooperative queues and propose the first learning decentralized algorithm guaranteeing stability of the system as long as the ratio of rates is larger than 1, thus reaching performances comparable to centralized strategies.
Adaptive Matching for Expert Systems with Uncertain Task Types. Virag Shah, Lennart Gulikers, Laurent Massoulié, and Milan Vojnović
Operations Research, INFORMS (2020)
Presenter: Ellen Cardinaels – Date: January 25, 2022

A matching in a two-sided market often incurs an externality: a matched resource may become unavailable to the other side of the market, at least for a while. This is especially an issue in online platforms involving human experts, as the expert resources are often scarce. The efficient utilization of experts in these platforms is made challenging by the fact that the information available about the parties involved is usually limited. To address this challenge, we develop a model of a task-expert matching system where a task is matched to an expert using not only the prior information about the task but also the feedback obtained from the past matches. In our model, the tasks arrive online while the experts are fixed and constrained by a finite service capacity. For this model, we characterize the maximum task resolution throughput a platform can achieve. We show that the natural greedy approach where each expert is assigned a task most suitable to his or her skill is suboptimal, as it does not internalize the aforementioned externality. We develop a throughput-optimal backpressure algorithm which does so by accounting for the “congestion” among different task types. Finally, we validate our model and confirm our theoretical findings with data-driven simulations via logs of Math.StackExchange.com, a Stack Overflow forum dedicated to mathematics.

Academic year 2020–2021

Organizers: Céline Comte and Sem Borst

Case-Based Reinforcement Learning for Dynamic Inventory Control in a Multi-Agent Supply-Chain System. Chengzhi Jiang and Zhaohan Sheng
Expert Systems with Applications, Elsevier (2009)
Presenter: Peter Verleijsdonk – Date: June 24, 2021

Reinforcement learning (RL) appeals to many researchers in recent years because of its generality. It is an approach to machine intelligence that learns to achieve the given goal by trial-and-error iterations with its environment. This paper proposes a case-based reinforcement learning algorithm (CRL) for dynamic inventory control in a multi-agent supply-chain system. Traditional time-triggered and event-triggered ordering policies remain popular because they are easy to implement. But in the dynamic environment, the results of them may become inaccurate causing excessive inventory (cost) or shortage. Under the condition of nonstationary customer demand, the S value of (T, S) and (Q, S) inventory review method is learnt using the proposed algorithm for satisfying target service level, respectively. Multi-agent simulation of a simplified two-echelon supply chain, where proposed algorithm is implemented, is run for a few times. The results show the effectiveness of CRL in both review methods. We also consider a framework for general learning method based on proposed one, which may be helpful in all aspects of supply-chain management (SCM). Hence, it is suggested that well-designed ‘‘connections” are necessary to be built between CRL, multi-agent system (MAS) and SCM.
A Variational Approach to Network Games. Emerson Melo
Working paper
Presenter: Janusz Meylahn – Date: May 27, 2021

This paper studies strategic interaction in networks. We focus on games of strategic substitutes and strategic complements, and departing from previous literature, we do not assume particular functional forms on players' payoffs. By exploiting variational methods, we show that the uniqueness, the comparative statics, and the approximation of a Nash equilibrium are determined by a precise relationship between the lowest eigenvalue of the network, a measure of players' payoff concavity, and a parameter capturing the strength of the strategic interaction among players. We apply our framework to the study of aggregative network games, games of mixed interactions, and Bayesian network games.
Analyzing Product-Form Stochastic Networks Via Factor Graphs and the Sum-Product Algorithm. Jian Ni and Sekhar Tatikonda
Transactions on Communications, IEEE (2007)
Presenter: Lucas van Kreveld – Date: May 20, 2021

A large number of stochastic networks including loss networks and certain queueing networks have product-form steady-state probabilities. However, for most practical networks, evaluating the system performance is a difficult task due to the presence of a normalization constant. We propose a new framework based on probabilistic graphical models to tackle this task. Specifically, we use factor graphs to model the stationary distribution of a network. For networks with arbitrary topology, we can apply efficient message-passing algorithms like the sum-product algorithm to compute the exact or approximate marginal distributions of all state variables and related performance measures such as blocking probabilities. Through extensive numerical experiments, we show that the sum-product algorithm returns very accurate blocking probabilities and greatly outperforms the reduced load approximation for loss networks with a variety of topologies. The factor graph model also provides a promising approach for analyzing product-form queueing networks.
Applying Deep Learning to the Newsvendor Problem. Afshin Oroojlooyjadid, Lawrence V. Snyder, and Martin Takáč
IISE Transactions, Taylor & Francis (2019)
Presenter: Dennis Schol – Date: April 15, 2021

The newsvendor problem is one of the most basic and widely applied inventory models. If the probability distribution of the demand is known, the problem can be solved analytically. However, approximating the probability distribution is not easy and is prone to error; therefore, the resulting solution to the newsvendor problem may not be optimal. To address this issue, we propose an algorithm based on deep learning that optimizes the order quantities for all products based on features of the demand data. Our algorithm integrates the forecasting and inventory-optimization steps, rather than solving them separately, as is typically done, and does not require knowledge of the probability distributions of the demand. One can view the optimal order quantities as the labels in the deep neural network. However, unlike most deep learning applications, our model does not know the true labels (order quantities), but rather learns them during the training. Numerical experiments on real-world data suggest that our algorithm outperforms other approaches, including data-driven and machine learning approaches, especially for demands with high volatility. Finally, in order to show how this approach can be used for other inventory optimization problems, we provide an extension for (r, Q) policies.
Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent State. Shi Dong, Benjamin Van Roy, and Zhengyuan Zhou
arXiv:2102.05261 [cs] (2021)
Presenter: Tom Pijnappel – Date: April 1, 2021

We design a simple reinforcement learning agent that, with a specification only of suitable internal state dynamics and a reward function, can operate with some degree of competence in any environment. The agent maintains visitation counts and value estimates for associated state-action pair. The value function is updated incrementally in response to temporal differences and optimistic boosts that encourage exploration. The agent executes actions that are greedy with respect to this value function. We establish a regret bound demonstrating convergence to near-optimal per-period performance, where the time taken to achieve near-optimality is polynomial in the number of internal states and actions, as well as the reward averaging time of the best policy within the reference policy class, which is comprised of those that depend on history only through the agent's internal state. Notably, there is no further dependence on the number of environment states or mixing times associated with other policies or statistics of history. Our result sheds light on the potential benefits of (deep) representation learning, which has demonstrated the capability to extract compact and relevant features from high-dimensional interaction histories.
Human-Level Control Through Deep Reinforcement Learning. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis
Nature (2018)
Presenter: Jaap Storm – Date: March 11, 2021

The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Competitive Caching with Machine Learned Advice. Thodoris Lykouris and Sergei Vassilvtiskii
35th International Conference on Machine Learning, PMLR (2018)
Presenter: Maria Vlasiou – Date: February 25, 2021

Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error.

In this work we develop a framework for augmenting online algorithms with a machine learned oracle to achieve competitive ratios that provably improve upon unconditional worst case lower bounds when the oracle has low error. Our approach treats the oracle as a complete black box, and is not dependent on its inner workings, or the exact distribution of its errors.

We apply this framework to the traditional caching problem -- creating an eviction strategy for a cache of size k. We demonstrate that naively following the oracle's recommendations may lead to very poor performance, even when the average error is quite low. Instead we show how to modify the Marker algorithm to take into account the oracle's predictions, and prove that this combined approach achieves a competitive ratio that both (i) decreases as the oracle's error decreases, and (ii) is always capped by O(logk), which can be achieved without any oracle input. We complement our results with an empirical evaluation of our algorithm on real world datasets, and show that it performs well empirically even using simple off-the-shelf predictions.
Large Scale Charging of Electric Vehicles: A Multi-Armed Bandit Approach. Zhe Yu, Yunjian Xu, and Lang Tong
53rd Annual Allerton Conference on Communication, Control, and Computing, IEEE (2015)
Presenter: Mark Christianen – Date: February 11, 2021

The successful launch of electric vehicles (EVs) depends critically on the availability of convenient and economic charging facilities. The problem of scheduling of large-scale charging of EVs by a service provider is considered. A Markov decision process model is introduced in which EVs arrive randomly at a charging facility with random demand and completion deadlines. The service provider faces random charging costs, convex non-completion penalties, and a peak power constraint that limits the maximum number of simultaneous activation of EV chargers.

Formulated as a restless multi-armed bandit problem, the EV charging problem is shown to be indexable. A closed-form expression of the Whittle's index is obtained for the case when the charging costs are constant. The Whittle's index policy, however, is not optimal in general. An enhancement of the Whittle's index policy based on spatial interchange according to the less laxity and longer processing time principle is presented. The proposed policy outperforms existing charging algorithms, especially when the charging costs are time varying.
Delay-Predictability Trade-offs in Reaching a Secret Goal. John N. Tsitsiklis and Kuang Xu
Operations Research, INFORMS (2018)
Presenter: Diego Goldsztajn – Date: January 28, 2021

We formulate a model of sequential decision making, dubbed the Goal Prediction game, to study the extent to which an overseeing adversary can predict the final goal of an agent who tries to reach that goal quickly, through a sequence of intermediate actions. Our formulation is motivated by the increasing ubiquity of large-scale surveillance and data collection infrastructures, which can be used to predict an agent’s intentions and future actions, despite the agent’s desire for privacy.

Our main result shows that with a carefully chosen agent strategy, the probability that the agent’s goal is correctly predicted by an adversary can be made inversely proportional to the time that the agent is willing to spend in reaching the goal, but cannot be made any smaller than that. Moreover, this characterization depends on the topology of the agent’s state space only through its diameter.
Time-to-Green Predictions: A Framework to Enhance SPaT Messages Using Machine Learning. Alexander Genser, Lukas Ambuhl, Kaidi Yang, Monica Menendez, and Anastasios Kouvelas
International Conference on Intelligent Transportation Systems, IEEE (2020)
Presenter: Rik Timmerman – Date: January 14, 2021

Recently, efforts were made to standardize Signal Phase and Timing (SPaT) messages. Such messages contain the current signal phase with a prediction for the corresponding residual time for all approaches of a signalized intersection. Hence, the information can be utilized for the motion planning of human-driven/autonomously operated individual or public transport vehicles. Consequently, this leads to a more homogeneous traffic flow and a smoother speed profile. Unfortunately, adaptive signal control systems make it difficult to predict the SPaT information accurately. In this paper, we propose a novel machine learning approach to forecast the time series of residual times. A prediction framework that utilizes a Random Survival Forest (RSF) and a Long-Short-Term-Memory (LSTM) neural network is implemented. The machine learning models are compared to a Linear Regression (LR) model. For a proof of concept, the models are applied to a case study in the city of Zurich. Results show that the machine learning models outperform the LR approach, and in particular, the LSTM neural network is a promising tool for the enhancement of SPaT messages.
Queueing Network Controls via Deep Reinforcement Learning. Jim Dai and Mark Gluzman
arXiv:2008.01644v5
Presenter: Martin Zubeldia – Date: November 26 and December 3, 2020

Novel advanced policy gradient (APG) methods, such as Trust Region policy optimization and Proximal policy optimization (PPO), have become the dominant reinforcement learning algorithms because of their ease of implementation and good practical performance. A conventional setup for notoriously difficult queueing network control problems is a Markov decision problem (MDP) that has three features: infinite state space, unbounded costs, and long-run average cost objective. We extend the theoretical framework of these APG methods for such MDP problems. The resulting PPO algorithm is tested on a parallel-server system and large-size multiclass queueing networks. The algorithm consistently generates control policies that outperform state-of-art heuristics in literature in a variety of load conditions from light to heavy traffic. These policies are demonstrated to be near-optimal when the optimal policy can be computed.

A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function via sampling. First, we use a discounted relative value function as an approximation of the relative value function. Second, we propose regenerative simulation to estimate the discounted relative value function. Finally, we incorporate the approximating martingale-process method into the regenerative estimator.
Active Sampling for the Quickest Detection of Markov Networks. Javad Heydari, Ali Tajer, and H. Vincent Poor
2015 Annual Allerton Conference on Communication, Control, and Computing, IEEE (2015)
Presenter: Albert Senen Cerda – Date: November 12, 2020

Consider n random variables forming a Markov random field (MRF). The true model of the MRF is unknown, and it is assumed to belong to a binary set. The objective is to sequentially sample the random variables (one-at-a-time) such that the true MRF model can be detected with the fewest number of samples, while in parallel, the decision reliability is controlled. The core element of an optimal decision process is a rule for selecting and sampling the random variables over time. Such a process, at every time instant and adaptively to the collected data, selects the random variable that is expected to be most informative about the model, rendering an overall minimized number of samples required for reaching a reliable decision. The existing studies on detecting MRF structures generally sample the entire network at the same time and focus on designing optimal detection rules without regard to the data-acquisition process. This paper characterizes the sampling process for general MRFs, which, in conjunction with the sequential probability ratio test, is shown to be optimal in the asymptote of large n. The critical insight in designing the sampling process is devising an information measure that captures the decisions' inherent statistical dependence over time. Furthermore, when the MRFs can be modeled by acyclic probabilistic graphical models, the sampling rule is shown to take a computationally simple form. Performance analysis for the general case is provided, and the results are interpreted in several special cases: Gaussian MRFs, non-asymptotic regimes, connection to Chernoff's rule to controlled (active) sensing, and the problem of cluster detection.
Iterative Learning of Graph Connectivity From Partially-Observed Cascade Samples. Jiin Woo, Jungseul Ok, and Yung Yi
Twenty-First International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, ACM (2020)
Presenter: Jaron Sanders – Date: October 22, 2020

Graph learning is an inference problem of estimating connectivity of a graph from a collection of epidemic cascades, with many useful applications in the areas of online/offline social networks, p2p networks, computer security, and epidemiology. We consider a practical scenario when the information of cascade samples are partially observed in the independent cascade (IC) model. For the graph learning problem, we propose an efficient algorithm that solves a localized version of computationally-intractable maximum likelihood estimation through approximations in both temporal and spatial aspects. Our algorithm iterates the operations of recovering missing time logs and inferring graph connectivity, and thereby progressively improves the inference quality. We study the sample complexity, which is the number of required cascade samples to meet a given inference quality, and show that it is asymptotically close to a lower bound, thus near-order-optimal in terms of the number of nodes. We evaluate the performance of our algorithm using five real-world social networks, whose size ranges from 20 to 900, and demonstrate that our algorithm performs better than other competing algorithms in terms of accuracy while maintaining fast running time.