Welcome on the webpage of the reading group "Stochastic Networks and Learning" organized by the Stochastic Operations Research group at the Eindhoven University of Technology!
Learning techniques become more and more relevant to the design and optimization of stochastic networks as the size and complexity of these systems increases. In this reading group, we propose to share our findings on this topic by organizing biweekly meetings where a group member presents a paper of their choice.
This page lists the past and future papers presented at the reading group.
Organizers: Céline Comte and Sem Borst
Reinforcement learning (RL) appeals to many researchers in recent years because of its generality. It is an approach to machine intelligence that learns to achieve the given goal by trial-and-error iterations with its environment. This paper proposes a case-based reinforcement learning algorithm (CRL) for dynamic inventory control in a multi-agent supply-chain system. Traditional time-triggered and event-triggered ordering policies remain popular because they are easy to implement. But in the dynamic environment, the results of them may become inaccurate causing excessive inventory (cost) or shortage. Under the condition of nonstationary customer demand, the S value of (T, S) and (Q, S) inventory review method is learnt using the proposed algorithm for satisfying target service level, respectively. Multi-agent simulation of a simplified two-echelon supply chain, where proposed algorithm is implemented, is run for a few times. The results show the effectiveness of CRL in both review methods. We also consider a framework for general learning method based on proposed one, which may be helpful in all aspects of supply-chain management (SCM). Hence, it is suggested that well-designed ‘‘connections” are necessary to be built between CRL, multi-agent system (MAS) and SCM.
This paper studies strategic interaction in networks. We focus on games of strategic substitutes and strategic complements, and departing from previous literature, we do not assume particular functional forms on players' payoffs. By exploiting variational methods, we show that the uniqueness, the comparative statics, and the approximation of a Nash equilibrium are determined by a precise relationship between the lowest eigenvalue of the network, a measure of players' payoff concavity, and a parameter capturing the strength of the strategic interaction among players. We apply our framework to the study of aggregative network games, games of mixed interactions, and Bayesian network games.
Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error.
In this work we develop a framework for augmenting online algorithms with a machine learned oracle to achieve competitive ratios that provably improve upon unconditional worst case lower bounds when the oracle has low error. Our approach treats the oracle as a complete black box, and is not dependent on its inner workings, or the exact distribution of its errors.
We apply this framework to the traditional caching problem -- creating an eviction strategy for a cache of size k. We demonstrate that naively following the oracle's recommendations may lead to very poor performance, even when the average error is quite low. Instead we show how to modify the Marker algorithm to take into account the oracle's predictions, and prove that this combined approach achieves a competitive ratio that both (i) decreases as the oracle's error decreases, and (ii) is always capped by O(logk), which can be achieved without any oracle input. We complement our results with an empirical evaluation of our algorithm on real world datasets, and show that it performs well empirically even using simple off-the-shelf predictions.
The successful launch of electric vehicles (EVs) depends critically on the availability of convenient and economic charging facilities. The problem of scheduling of large-scale charging of EVs by a service provider is considered. A Markov decision process model is introduced in which EVs arrive randomly at a charging facility with random demand and completion deadlines. The service provider faces random charging costs, convex non-completion penalties, and a peak power constraint that limits the maximum number of simultaneous activation of EV chargers.
Formulated as a restless multi-armed bandit problem, the EV charging problem is shown to be indexable. A closed-form expression of the Whittle's index is obtained for the case when the charging costs are constant. The Whittle's index policy, however, is not optimal in general. An enhancement of the Whittle's index policy based on spatial interchange according to the less laxity and longer processing time principle is presented. The proposed policy outperforms existing charging algorithms, especially when the charging costs are time varying.
We formulate a model of sequential decision making, dubbed the Goal Prediction game, to study the extent to which an overseeing adversary can predict the final goal of an agent who tries to reach that goal quickly, through a sequence of intermediate actions. Our formulation is motivated by the increasing ubiquity of large-scale surveillance and data collection infrastructures, which can be used to predict an agent’s intentions and future actions, despite the agent’s desire for privacy.
Our main result shows that with a carefully chosen agent strategy, the probability that the agent’s goal is correctly predicted by an adversary can be made inversely proportional to the time that the agent is willing to spend in reaching the goal, but cannot be made any smaller than that. Moreover, this characterization depends on the topology of the agent’s state space only through its diameter.
Recently, efforts were made to standardize Signal Phase and Timing (SPaT) messages. Such messages contain the current signal phase with a prediction for the corresponding residual time for all approaches of a signalized intersection. Hence, the information can be utilized for the motion planning of human-driven/autonomously operated individual or public transport vehicles. Consequently, this leads to a more homogeneous traffic flow and a smoother speed profile. Unfortunately, adaptive signal control systems make it difficult to predict the SPaT information accurately. In this paper, we propose a novel machine learning approach to forecast the time series of residual times. A prediction framework that utilizes a Random Survival Forest (RSF) and a Long-Short-Term-Memory (LSTM) neural network is implemented. The machine learning models are compared to a Linear Regression (LR) model. For a proof of concept, the models are applied to a case study in the city of Zurich. Results show that the machine learning models outperform the LR approach, and in particular, the LSTM neural network is a promising tool for the enhancement of SPaT messages.
Novel advanced policy gradient (APG) methods, such as Trust Region policy optimization and Proximal policy optimization (PPO), have become the dominant reinforcement learning algorithms because of their ease of implementation and good practical performance. A conventional setup for notoriously difficult queueing network control problems is a Markov decision problem (MDP) that has three features: infinite state space, unbounded costs, and long-run average cost objective. We extend the theoretical framework of these APG methods for such MDP problems. The resulting PPO algorithm is tested on a parallel-server system and large-size multiclass queueing networks. The algorithm consistently generates control policies that outperform state-of-art heuristics in literature in a variety of load conditions from light to heavy traffic. These policies are demonstrated to be near-optimal when the optimal policy can be computed.
A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function via sampling. First, we use a discounted relative value function as an approximation of the relative value function. Second, we propose regenerative simulation to estimate the discounted relative value function. Finally, we incorporate the approximating martingale-process method into the regenerative estimator.
Consider n random variables forming a Markov random field (MRF). The true model of the MRF is unknown, and it is assumed to belong to a binary set. The objective is to sequentially sample the random variables (one-at-a-time) such that the true MRF model can be detected with the fewest number of samples, while in parallel, the decision reliability is controlled. The core element of an optimal decision process is a rule for selecting and sampling the random variables over time. Such a process, at every time instant and adaptively to the collected data, selects the random variable that is expected to be most informative about the model, rendering an overall minimized number of samples required for reaching a reliable decision. The existing studies on detecting MRF structures generally sample the entire network at the same time and focus on designing optimal detection rules without regard to the data-acquisition process. This paper characterizes the sampling process for general MRFs, which, in conjunction with the sequential probability ratio test, is shown to be optimal in the asymptote of large n. The critical insight in designing the sampling process is devising an information measure that captures the decisions' inherent statistical dependence over time. Furthermore, when the MRFs can be modeled by acyclic probabilistic graphical models, the sampling rule is shown to take a computationally simple form. Performance analysis for the general case is provided, and the results are interpreted in several special cases: Gaussian MRFs, non-asymptotic regimes, connection to Chernoff's rule to controlled (active) sensing, and the problem of cluster detection.
Graph learning is an inference problem of estimating connectivity of a graph from a collection of epidemic cascades, with many useful applications in the areas of online/offline social networks, p2p networks, computer security, and epidemiology. We consider a practical scenario when the information of cascade samples are partially observed in the independent cascade (IC) model. For the graph learning problem, we propose an efficient algorithm that solves a localized version of computationally-intractable maximum likelihood estimation through approximations in both temporal and spatial aspects. Our algorithm iterates the operations of recovering missing time logs and inferring graph connectivity, and thereby progressively improves the inference quality. We study the sample complexity, which is the number of required cascade samples to meet a given inference quality, and show that it is asymptotically close to a lower bound, thus near-order-optimal in terms of the number of nodes. We evaluate the performance of our algorithm using five real-world social networks, whose size ranges from 20 to 900, and demonstrate that our algorithm performs better than other competing algorithms in terms of accuracy while maintaining fast running time.
Once or twice a month, a different group member gives a lecture on a document (research paper, survey paper, book chapter...) of their choice. The other group members are encouraged to ask questions and initiate discussions during the lecture. The average workload per member is very light since they only present a paper about once every [number of participants]/2 months.
The objective is to make the presentation as interactive as possible despite the online format. Therefore, we propose that the speaker uses one of the following two media: