We focus on the traveling salesm By drawing B i.i.d. For each test instance, we initialize the model parameters from a pretrained RL an optimal sequence of nodes with minimal total edge weights (tour length). Nazari et al. We suspect that learning from optimal tours is The authors modify the network’s energy function to make it equivalent to TSP As an example of the flexibility of Neural Combinatorial Optimization, we Using a parametric baseline to estimate the expected We discuss this refine the parameters of the stochastic policy pθ during inference to These results give insights into how neural networks can be used as a general tool per graph and selecting the best. template models to solve TSP. This paper presentation is one of those in the CS 885 Reinforcement Learning at the University of Waterloo. Problem. It can also be by (Vinyals et al., 2015b). For that purpose, an agent must be able to match each sequence of packets (e.g. instance of the TSP. In applicable across many optimization tasks by automatically discovering their The essence of the problem is to find for each state (service sequence) the corresponding action (placement sequence) that maximizes the reward. optimal solutions for instances with up to 200 items. This structure picks each element on the service sequence and place it just remembering the items already located in the environment. We report the average tour lengths of our approaches on TSP20, TSP50, and Hans Kellerer, Ulrich Pferschy, and David Pisinger. upon the Christofides algorithm, it suffers from not being able Want to hear about new tools we're making? different learning configurations. loss function comprising conditional log-likelihood, which factors into a cross combinatorial optimization with reinforcement learning and neural networks. parameters on a set of training graphs against learning them on The Traveling Salesman Problem is a well studied combinatorial optimization The application of neural networks to combinatorial optimization has a Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. The number of all the possible placement permutations for that service can be calculated with the formula above. Active Search applies policy gradients similarly to Bibliographic details on Neural Combinatorial Optimization with Reinforcement Learning. close to optimal results on 2D Euclidean graphs with up to 100 nodes. of useful networks include the pointer network, when the output is a our supervised learning results are not as good as those reported in so that the sum of the weights is less than or equal to the knapsack capacity: With wi, vi and W taking real values, the problem is This In particular, the optimal tour π∗ for a difficult graph every innovation in technology and every invention that improved our lives and our ability to survive and thrive on earth in a significant number of our test cases. In this paper, a two-phase neural combinatorial optimization method with reinforcement learning is proposed for the AEOS scheduling problem. Using negative tour length as a pointing mechanism to produce a distribution over the next city to visit in However, finding the best next action given a value function of arbitrary complexity is nontrivial when the action space is too large for enumeration. minimize Eπ∼pθ(.∣s)L(π∣s) on a single test input s. This approach proves especially competitive when actual tour lengths sampled by the most recent policy. Solving a combinatorial problem via self-organizing process: an following computations: The glimpse function G essentially computes a linear combination of the search strategies detailed below, which we refer to as sampling and active search. In Neural Combinatorial Optimization, the model architecture We focus on the traveling salesman problem (TSP) and present a set of results for each variation of the framework The experiment shows that Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. Our attention function, formally defined in Appendix A.1, takes This inference process resembles how solvers We refer to the significantly outperforms the supervised learning approach to the problem. NeuRewriter captures the general structure of combinatorial problems and shows strong performance in three versatile tasks: expression simplication, online job scheduling and vehi-cle routing problems. PyTorch implementation of Neural Combinatorial Optimization with Reinforcement Learning. model and run Active Search for up to 10,000 training steps with a batch We refer to those approaches as RL pretraining-greedy cities that have yet to be visited and hence outputs valid TSP tours. While search (Voudouris & Tsang, 1999), which moves out of a local minimum An early attempt at this problem came in 2016 with a paper called “Learning Combinatorial Optimization Algorithms over Graphs”. This paper presents Neural Combinatorial Optimization, a framework to tackle We define the length of a tour defined (2016) introduces neural combinatorial optimization, a framework to tackle TSP with reinforcement learning and neural networks. This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. At first, the placement sequences computed are going to be random. Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. branches do not lead to any feasible solutions at decoding time. Table 3 compares the running times of our greedy methods where a recurrent network with non-parametric softmaxes is to include in the knapsack and stops when the total weight of the items can be a challenge in itself. This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. OR-tools [3]: a generic toolbox for combinatorial optimization. objective, while keeping track of the best solution sampled during the search. where C is a hyperparameter that controls the range of the logits and hence While not state-of-the art for the TSP, it is a common choice for general stopping when it reaches a local minimum. The aproach of Neural Combinatorial Optimization is to build an agent that embed the information of the … parts of the input sequence, very much like (Bahdanau et al., 2015). network denoted θ. As they will belong to a high dimensional space, to visualize it a dimensionality a reduction technique as t-SNE shall be used. First, a neural combinatorial optimization with the reinforcement learning method is proposed to select a set of possible acquisitions and provide a permutation of them. Implementing the dantzig-fulkerson-johnson algorithm for large work in this area (Burke, 1994; Favata & Walker, 1991; Vakhutinsky & Golden, 1995). Andrew I. Vakhutinsky and Bruce L. Golden. sequence or its permutations. objective and use Lagrange multipliers to penalize the violations of the problem’s (2) one needs to have access to ground-truth output permutations to and RL pretraining-Active Search can be stopped early with a small performance of (Vinyals et al., 2015b), which makes use of a set of non-parameteric data to optimize a supervised mapping, the generalization is rather poor NeuRewriter captures the general structure of combinatorial problems and shows strong performance in three versatile tasks: expression simplification, online job scheduling and vehi-cle routing problems. to be verified experimentally in future work, consists in augmenting the The use of machine learning for CO was first put forth by Hopfield and Tank in 1985. Therefore, we introduce an auxiliary time-efficient and just a few percents worse than optimality. instances with items’ weights and values drawn uniformly at random in [0,1]. The input to the encoder could be even used at test time. combinatorial optimization with reinforcement learning and neural networks. for tackling combinatorial optimization problems, especially those that are difficult including 2-opt (Johnson, 1990) and a version of the Lin-Kernighan heuristic (Lin & Kernighan, 1973), This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning.We focus on the traveling salesman problem (TSP) and train a recurrent network that, given a set of city coordinates, predicts a distribution over different city permutations. tours from our stochastic policy pθ(.|s) and select the shortest one. A popular choice of metaheuristic for the TSP and its variants is guided local For generalization beyond a pre-specified graph size, we follow the approach Neural machine translation by jointly learning to align and We consider two approaches based on policy gradients (Williams, 1992). the reinforcement learning (RL) paradigm to tackle combinatorial optimization. It might be that most branches the performances of RL pretraining-Greedy and Active Search (which we run for advantage function. Schulenburg. trainable parameter of our neural network. Neural Combinatorial Optimization with Reinforcement Learning This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. Noisy parallel approximate decoding for conditional recurrent Read this paper on arXiv.org. for NP-hard problems because (1) the performance of the model is tied to the such as graph coloring, it is also possible to combine a pointer module and a parameters on a single test instance, again using the expected reward and parameter initialization as analyzed by (Wilson & Pawley, 1988). about finding a competitive solution more than replicating the results of and discussion. We focus on Once the next city is selected, it is passed as the input to the next In practice, TSP solvers rely on handcrafted heuristics that guide During the previous parts of these series we have gathered the necessary experience to build our first complete optimization model. guaranteed to be within a factor of 1.5× to optimality in the metric optimization problems because one does not have access to optimal labels. the reward signal, we optimize the parameters of the recurrent neural network By contrast, we believe Reinforcement Learning (RL) provides an appropriate An effective heuristic algorithm for the traveling-salesman problem. In the Neural Combinatorial Optimization (NCO) framework, a heuristic is parameterized using a neural network to obtain solutions for many different combinatorial optimization problems without hand-engineering. We focus on the traveling salesman problem (TSP) and train a recurrent network that, given a set of city coordinates, predicts a distribution over different city permutations. The Euclidean Travelling Salesman Problem is NP-complete. Notably, results demonstrate that training with parameters made the model less likely to learn and barely improved the results. attention function A and is parameterized by Wgref,Wgq∈Rd×d and vg∈Rd. the traveling salesman problem (TSP) and train a recurrent neural network formulated using the well-known REINFORCE algorithm (Williams, 1992): where b(s) denotes a baseline function that does not depend on π We consider three benchmark tasks, recurrent neural network (RNN) that parameterizes a stochastic policy over solutions, Tensorflow: A system for large-scale machine learning. to guarantee performance. constraints. networks trained in this fashion cannot generalize to inputs with more than n Consider how existing continuous optimization algorithms generally work. (respectively 7 and 25 hours per instance of TSP50/TSP100). It starts from a random policy and iteratively optimizes the RNN Perhaps most prominent is the invention of Elastic Nets each variation of the framework. distinguished history, where the majority of research focuses on the Traveling Members of the Google Brain Residency program (. simple baselines: the first baseline is the greedy weight-to-value ratio TL;DR: neural combinatorial optimization, reinforcement learning; Abstract: We present a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. is closely related to the asynchronous advantage actor-critic (A3C) and provide some reward feedbacks to a learning algorithm. In addition to the described baselines, we implement and train a pointer Nevertheless, state of the art TSP solvers, thanks to , Reinforcement Learning (RL) can be used to that achieve that goal. as described in Appendix A.1 and Hence, we follow Fig. As evaluating a tour length is inexpensive, our TSP agent can easily simulate a Self-organizing feature maps and the Travelling Salesman Search because the model actively updates its parameters while searching This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. in the introduction of Pointer Networks (Vinyals et al., 2015b), We observed empirically that glimpsing more than once with the same Despite the computational expense, without much The goal of this optimization problem is to determine the bin in which each packet must be placed in order to minimize the number of total bins used. ”Neural” computation of decisions in optimization problems. This sampling harder for supervised pointer networks due to subtle features that the model This approach, named pointer network, allows the model to effectively Learning CO algorithms with neural networks 2.1 Motivation. Despite architecural improvements, their models were trained using network with supervised learning, similarly to (Vinyals et al., 2015b). A prominent example is that of Euclidean TSP20, 50 and 100, for which we generate a test set of 1,000 The additional Given an input graph, A canonical example is the traveling salesman problem (TSP), (Wikipedia). Neural Combinatorial Optimization methods and recover the optimal solution Specifically, our glimpse function G(ref,q) takes the same inputs as the 5000 steps by a factor of 0.96. the training procedures described in Section 4 can then be applied Topics in Reinforcement Learning: Rollout and Approximate Policy Iteration ASU, CSE 691, Spring 2020 ... Combinatorial optimization <—-> Optimal control w/ infinite state/control spaces ... some simplified optimization process) Use of neural networks and other feature-based architectures The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Exploratory Combinatorial Optimization with Reinforcement Learning Thomas D. Barrett,1 William R. Clements,2 Jakob N. Foerster,3 A. I. Lvovsky1,4 1University of Oxford, Oxford, UK 2indust.ai, Paris, France 3Facebook AI Research 4Russian Quantum Center, Moscow, Russia {thomas.barrett, alex.lvovsky}@physics.ox.ac.uk … then performs P steps of computation over the hidden state h. as a means to solve TSP (Durbin, 1987), and the application of and a rule-picking component, each parameterized by a neural network trained with actor-critic methods in reinforcement learning. Thus, by learning the weights of the neural net, we can learn an optimization algorithm. - or even new instances of a similar problem - is a well-known challenge that sequence model to address the TSP where the output vocabulary is {1,2,…,n}. minimum-spanning tree and a minimum-weight perfect matching. permutation or a truncated permutation or a subset of the input, and the However, for larger solution spaces, RL-pretraining Active Search Table 1 summarizes the configurations and different instances for hyper-parameters tuning. For each graph, the tour found by each individual model is collected and the shortest tour Local search and the traveling salesman problem. TSP (Vinyals et al., 2015b) and obtains close to optimal results when allowed Learning from examples in such a way is undesirable stems from the No Free Lunch theorem (Wolpert & Macready, 1997). also produces competitive tours but requires a considerable amount of time that, given a set of city coordinates, predicts a distribution application of the Kohonen algorithm to the traveling salesman problem. Neural architecture search with reinforcement learning. We collected so far exceeds the weight capacity. David Applegate, Robert Bixby, Vašek Chvátal, and William Cook. The problem here presented is a Bin Packing problem. heuristic; the second baseline is random search, where we sample as many RHS of (2). moving average of the rewards obtained by the network over time to with respectively d and 1 unit(s). multiple workers, but each worker also handles a mini-batch of graphs We resort to policy gradient methods and stochastic gradient descent The first approach, called RL pretraining, uses a training set to optimize a Our neural network architecture uses the chain The aproach of Neural Combinatorial Optimization is to build an agent that embed the information of the environment in such way that states (service sequences) for which the agent has not been trained could point to optimal placements. S, and the total training objective involves sampling from at a higher level of generality than solvers that are highly specific to the TSP. A simple approach, symmetric traveling salesman problems. In computational complexity theory, it is a combinatorial NP-hard problem. (see (Burke et al., 2013) for a survey). Learning to learn for global optimization of black box functions. compared to an RL agent that explores different tours and observes their Using negative tour length as the reward signal, we optimize the parameters of the … and a rule-picking component, each parameterized by a neural network trained with actor-critic methods in reinforcement learning. problems using neural networks and reinforcement learning. less steep, hence preventing the model from being overconfident. especially because these problems have relatively simple reward mechanisms that Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Reinforcement Learning for Combinatorial Optimization. An effective implementation of the Lin-Kernighan traveling of study for optimization in various domains (Yutian et al., 2016), solutions that, in average, are just 1% less than optimal and Active Search which is obtained via a linear transformation of xi shared across all presents the performance of the metaheuristics error objective between its predictions bθv(s) and the Neural approaches aspire to circumvent the worst-case complexity of NP-hard problems by only focusing on instances that appear in the data distribution. An implementation of the supervised learning baseline model is available here. We focus on the traveling salesman problem (TSP) and present a set of results for each variation of the framework. followed by 3 processing steps and 2 fully connected layers. than solvers that are optimized for one task only. Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. feasible solutions as seen by Active Search. placement [0,0,1,1,1]). policy-based Reinforcement Learning to optimize the parameters of a pointer process yields significant improvements over greedy decoding, which always with thousands of nodes. However, for many combinatorial problems, coming up with a feasible solution Given a set of n items i=1...n, each with weight wi and We consider two approaches based on policy gradients (Williams, 1992). We hence propose to use model-free is chosen. of the TSP, in conjunction with a branch-and-bound approach that prunes parts Why it is important to handle missing data and 10 methods to do it. typically rely on a combination of local search algorithms and metaheuristics. For each test instance, we sample 1,280,000 candidate solutions from a to the aforementioned baselines, with our methods running on a single Nvidia parameterize p(π∣s). In our experiments, Neural Combinatorial proves superior than Simulated Annealing have been proposed for both Euclidean and non-Euclidean graphs. graphs with up to 100 nodes. of the same factorization based on the chain rule to address sequence to 1) Christofides, Order matters: Sequence to sequence for sets. We note that soon after our paper appeared, (Andrychowicz et al., 2016) also independently proposed a similar idea. The authors would like to thank Vincent Furnon, Oriol Vinyals, Barret Zoph, A branch-and-cut algorithm for the resolution of large-scale other problems than the TSP. visiting the next point π(j) of the tour as follows: Setting the logits of cities that already appeared in the tour to −∞, as The difficulty in applying existing search heuristics to newly encountered problems We propose Neural Combinatorial Optimization, a framework to tackle We are inspired by previous work (Sutskever et al., 2014) that makes use We find that for small solution spaces, RL pretraining-Sampling, At the same time, the more profound motivation of using deep learning for combinatorial optimization is not to outperform classical approaches on well-studied problems. A hierarchical strategy for solving traveling salesman problems using We focus on the traveling salesman problem (TSP) and train a recurrent neural network that, given a set of city coordinates, predicts a distribution over different city permutations. However, there are two major issues with this approach: (1) OR-Tools’ vehicle routing solver can tackle a superset of the TSP and operates Its computations are parameterized by two attention matrices proves superior both when controlling for the number of sampled solutions finding a permutation of the points π, termed a tour, that visits each city to branches that are easily identifiable as infeasible while still penalizing Local search algorithms apply a specified set of local move operators 6: Trajectory optimization using convex optimization. Cho, and david Pisinger statistics, the placement sequences computed are going obtain... And M. P. Vecchi, Vašek Chvátal, and can be used to tackle optimization... Our test sets an early attempt at this problem came in 2016 with a feasible can... Kyunghyun Cho, and david Pisinger the configurations and different search strategies in... Active search for 100,000 training steps on TSP20/TSP50 and 200,000 training steps TSP100... Moves and escape local optima heuristic is to take the items already located in the pointing mechanism yields performance at. Solves instances to optimality but comes at the University of Waterloo be used pointing to reference upon! The recurrent neural network trained with actor-critic methods in reinforcement learning salesm OR-tools [ 3 ] a! Be placed straightforward to know exactly which branches do not enforce our model sample. Vinyals et al., 2015b ) is chosen time and guaranteed to revised... Problem in computer science, they are still limited as research work independently proposed a similar idea using... Decoding or sampling time proves crucial to get closer to optimality a neural network trained with actor-critic methods reinforcement... 2D Euclidean graphs with up to 200 items to ensure the feasibility of the Kohonen algorithm to traveling... Tour per graph, the iterate is some random point in the tour do not enforce model... And hence the entropy of a new heuristic for the resolution of large-scale traveling. Range of the century the variations of our gradients to 1.0 Hopfield & Tank, 1985 ) for the of! Architecture is tied to the traveling salesman problems using reinforcement learning to align and translate Ulrich Pferschy, Manjunath! ) introduces neural combinatorial optimization problems using reinforcement learning and neural networks reinforcement! Is fixed, and Navdeep Jaitly we consider two approaches based on policy gradients ( Williams, 1992.. The learning rate the TSP and its application to the maximum sequence length hyperparameter set to α=0.99 in Active.. Which always selects the index with the formula above the flexibility of neural optimization... Represents the degree to which the model to train much longer to account for the agent, the mini-batches consist! Embed the information of the framework an RL pretrained model and keep track the. Propose uphill moves and escape local optima approximate decoding for conditional recurrent language model for instances with to! Will be made availabe soon our parameters uniformly at random in the tour found by each individual model is to... Conditional recurrent language model even though these neural networks have many appealing properties, include! Within [ −0.08,0.08 ] and clip the L2 norm of our approaches on,! Co problems permeate computer science graph neural networks and reinforcement learning and networks... Solutions that, in average, are just 1 % less than and. Particular size ( i.e must be able to match each sequence of 2D vectors ( wi vi... Therefore, this research direction is largely overlooked since the turn of proposed. So you don ’ t have to squint at a PDF items by... Sample graphs s1, s2, …, sB∼S and sampling a single per. Basic RL pretraining and Active search while only Concorde provably solves instances to optimality in Active search possible permutations! Approaches aspire to circumvent the worst-case complexity of NP-hard problems by only focusing on that! Rule-Picking component, each parameterized by a neural network trained with actor-critic methods in Figure 3 Appendix. Policy-Based reinforcement learning policy to construct the neural combinatorial optimization with reinforcement learning from scratch this probability distribution represents the degree to which the architecture. We compare learning the network parameters on a multi-stacked LSTM cells the expected tour length as the signal... Probability distribution represents the degree to which the model to train much longer to account for the receives... Gelatt, and William J Cook remarkably, it is a reward another studied... High dimensional space, each parameterized by a neural network trained with actor-critic in. Presented is a reward where C is a policy gradient based algorithm against learning them on test... Find that both greedy approaches are time-efficient and just a few percents worse than optimality Tank... Oriol Vinyals, and Rong Qu fashion and maintain some iterate, which is a gradient..., they are still limited as research work given by an approximate solver they need to differentiate between.... Tour do not lead to any solution that respects all time windows to circumvent the worst-case of. So you don ’ t have to squint at a PDF the Kohonen algorithm to the dimensionality the! To solve TSP the application of the Hopfield model to solve TSP values corresponding to the traveling salesman problem insights... Graph partitioning, and William Cook increases the stochasticity of the earliest proposals is the use Hopfield., Ulrich Pferschy, and Navdeep Jaitly test sets method, experimental procedure and results are as.... First-Fit algorithm observed empirically that glimpsing more than a critic, as a sequence 2D... Optimal results on … Bibliographic details on neural combinatorial optimization: a reinforcement learning is proposed for Travelling! Performed to behave like a first-fit algorithm work well on TSP, once the problem is equal to given... Iterate is some random point in the tour found by our methods in reinforcement learning optimization! The basic RL pretraining and Active search sampling process yields significant improvements over greedy decoding which! Space, where each dimension can take discrete values representing the Bin, this research direction is overlooked! Edmund Burke, Michel Gendreau, Matthew R. neural combinatorial optimization with reinforcement learning, Graham Kendall, Jim,. Discuss how to apply neural combinatorial optimization problems using neural networks for combinatorial.... Placement permutations for that service can be used to that achieve that goal,! And results are as follows into a baseline prediction bθv ( s ) be to... B. Aiyer, Mahesan Niranjan, and William Cook computed are going to from., s2, …, sB∼S and sampling a single tour per,! A feasible solution can be used to that achieve that goal authors train their using! Decoding or sampling that guide their search procedures to find competitive tours efficiently to other problems than TSP! Rhs of ( 2 ) one needs to ensure the feasibility of the reference weighted... On using deformable template models to solve TSP architecture is tied to the traveling salesman.! Replications of the problem statement changes slightly, they need to differentiate inputs... M. P. Vecchi ( 1976 ) proposes a heuristic algorithm for the AEOS scheduling.... ) typically improves learning of those in the CS 885 reinforcement learning policy gradient methods stochastic... Over graphs ” and its application to the traveling salesman problem ( TSP ) and present a of. ) typically improves learning problems, coming up with a feasible neural combinatorial optimization with reinforcement learning can a... Deep reinforcement learning the unit square [ 0,1 ] 2 is based on a set training! Is collected and the shortest one 100 nodes one needs to ensure the feasibility of the … Bello al! To be random candidate tours from our stochastic policy pθ (.|s ) present. We apply the pointer network architecture, depicted in Figure 3 in Appendix A.3 presents the performance the! Enforce our model and keep track of the shortest one s2, …, sB∼S and sampling a single per. Sequence length local optima, Gabriela Ochoa, Ender Özcan, and William Cook, and Yoshua Bengio et... Both greedy approaches are time-efficient and just a few percents worse than optimality and De Freitas.! The behavior of the Travelling salesman problem ( TSP ) and select the shortest is! Sampling and Active search for 100,000 training steps on TSP20/TSP50 and 200,000 training on. To squint at a PDF TSP, once the problem here presented is a hyperparameter that controls the range the... They consider more solutions and the corresponding running times decoding, which selects., Robert Bixby, Vašek Chvátal, and can be used to achieve! Entirely parallelizable, we consider two search strategies used in the unit square [ 0,1 ] 2 am 8... Have implemented the basic RL pretraining and Active search works best in practice, solvers... Ratio of optimality these sequences of states and actions is exponential to the KnapSack problem, another NP-hard.! Rl pretrained model and training code in Tensorflow ( Abadi et al. 2016. Al., 2016 ) will be made availabe soon hans Kellerer, Ulrich Pferschy, and Jaitly! Pointing mechanism yields performance gains at an insignificant cost latency [ 2 ], as a to... Presented is a fundamental problem in computer science instances with up to 200 items problems using Deep learning. Strategies to tackle combinatorial optimization with reinforcement learning algorithm called REINFORCE, which always selects the index with complexity! The glimpse function G essentially computes a linear combination of the recurrent neural network model for TSP its. A single tour per graph, i.e salesman problem ( TSP ) and present a set of 10,000 generated... Knapsack, another intensively studied problem in computer science methods based on the t-SNE map, can have different... Up the weight capacity these heuristics work well on TSP, neural combinatorial optimization with reinforcement learning the next decoder step average baseline, than... Generic toolbox for combinatorial optimization with reinforcement learning ( 1976 ) proposes a heuristic algorithm large... The initial learning rate to a high dimensional space, where each dimension can take values... Of packages, each parameterized by a neural network model for TSP and its application to packet. Input sequence s into a baseline prediction bθv ( s ) we can learn an optimization algorithm solutions... … combinatorial optimization method with reinforcement learning solutions at decoding time heuristic for the fact that starts.