User Tools

Site Tools


COST Action 15140: ImAppNIO - Working Group 3: Benchmarks - Open Discussion Wiki


Working Group 3 of the COST Action 15140: ImAppNIO is tasked with the development of practice-driven theory-affine benchmarks. Bringing together the perspectives which respectively dominate the efforts of WG1 and WG2 its task is to develop useful benchmark for nature-inspired search and optimisation heuristics with a strong focus on discrete search spaces and discrete optimisation problems making sure that the developed benchmarks are relevant from a practical perspective and accessible from a theoretical perspective.

It will pursue the same goals for discrete nature-inspired search and optimisation heuristics as the COmparing Continuous Optimisers (COCO) initiative for continuous optimisers. It will establish a workshop and the main international conferences in the field such as Genetic and Evolutionary Computation Conference - GECCO and International Conference on Parallel Problem Solving from Nature – PPSN. The focus will be on competitions based on these workshops. This will serve as a test bed for the results of WG1, a source of inspiration for WG2 and an important and highly visible focus point for anyone interest in the application of nature-inspired search and optimisation heuristics. The benchmarks that are developed will be made publicly available as part of the COST Action's Internet website and WG3 is responsible for providing and maintaining this.

Working Group 3 uses this Wiki to develop, discuss, and present contents in an open way.

PART A: Identifying the benchmark functions

Characteristics of the Sought Benchmark Problems

Following the spirit of COCO, the challenge is that the Benchmark functions should capture the difficulties of combinatorial problems in practice but at the same time be comprehensible such that algorithm behaviours can be understood or interpreted according to the performance on a given benchmark problem.

Ideally (not necessarily for all), we would like the benchmark functions to be:

1) scalable with the problem size;

2) to be non-trivial in the black box optimisation sense (the function may be shifted such that the global optimum may be any point).

While the challenge may be significant, especially for classical combinatorial optimisation problems (not so much for toy problems), achieving this goal would help greatly in bridging the gap between theoreticians and experimentalists.

Suggestions by Thomas Weise

I think there can be several approaches.

1. one could use some of the hardest instances of several classical problems and try to somehow standardize them.

2. one could try to develop new real-world problems (such as the Traveling Thief Problem by Markus Wagner).

3. one could try to develop a problem that can represent different hard aspects (such as NK or NKp landscapes).

Concerning Point 3, I have also done some work in this direction. Characterising what features make optimization problems hard and developing a model where these features can be tuned separately. I would suggest to include some settings of this function in the final test bed, as they are easy to understand, easy to implement, and fast to compute.

Suggestions by Markus Wagner

We in Adelaide have a bit of experience with interconnected problems, in particular, with an academic problem called the Travelling Thief Problem (TTP). This is a combination of two well-known combinatorial problems: the Knapsack Problem (KP) and the Travelling Salesperson Problem (TSP). The two are combined (among other) using a so-called rent rate R that needs to be paid for the time travelled - also, the speed reduces with increasing load, so packing maximally results in great slowdown and thus reduced overall objective score:

objective score = profit of items - R * travel time

Advantages regarding scalability and adjustability:

1. By changing R, we can shift the influence from one component to the other component: R=0 leaves us with the KP and R=large leaves us with the TSP.

2. If the tour is fixed and one is left with the knapsack problem, then Neumann and Polyakovskiy called this “Packing while Travelling” (PWT), for which a DP and a FPTAS exist. Thus, the search space is just the KP space.

3. There are 9720 instance online from with 51 to 85k cities and with 50 to 850k items ( For the tiniest instances, exact approaches are computationally feasible.

4. The way the instances (including R) are defined gives us at least 1 solution with objective score 0 (by considering “optimal TSP tour” and “maximum KP profit”). There are no guarantees beyond this.

5. Multi-objective considerations are straight-forward if necessary in the future (and are already being investigated), e.g., objective

6. A relaxed version has been presented at GECCO 2016 (Multiple Traveling Thieves Problem), where it is no longer necessary to visit all cities and it is possible to have multiple thieves.

The overall setting is most likely sitting at the “quite complex” end of the spectrum of functions that the WG3 might be after. Markus' perceived order of difficulty (increasing f.l.t.r.): PWT < TTP < MTTP.

A current article with an overview and comparison of relevant approaches: (Springer:

Suggestions by Thomas Stuetzle

I think SAT and TSP could be some problems but this very likely at least for SAT leads to duplication, as in the SAT community there is already since long times regular efforts to evaluate complete SAT solvers and local search solvers in the SAT competitions (I think they are annual or every two years).

For TSP there was the TSP challenge by David Johnson et al. I think for TSP it would make sense to try with new benchmarks.

For both concerns I would like to draw your attention to the following:

“These benchmark functions should capture the difficulties of combinatorial optimization problems in practice, but at the same time be comprehensible so that the resulting algorithm behaviors can be understood or interpreted.”


“The sought benchmark functions should ideally be scalable with the problem size and non-trivial in the black-box optimization sense, i.e., allow for shifting the optimum to any point.”

If we do not get instances at least as hard as in these competitions [SAT,TSP] and algorithms that are competitive with those already used in these competitions, I don’t see what particular type of impact one would make. Obviously, it is interesting to study algorithm behavior and so on but I think the most interesting is to study that of the high-performing ones (as is also typically done in the BBOB case!).

I think for each of the benchmarks one should give some baseline algorithms that are publicly available and that people can use to see whether they could compete (if this is a goal).

By the way, when doing such things seriously on combinatorial optimization, one notices that BBOB is so much easier, as there is only one problem with many instances, while for covering combinatorial optimization one would have to cover many problems each with many instances and often specialized algorithms.

The added difficulty is that in combinatorial optimization usually what counts is the computation time not function evaluations. So one needs a very good surrogate for computation time, conversion of computation times or evaluate all algorithms on a same platform.

As a side thing: I’d exclude polynomially solvable problems (at least those where n log n or also n^2 algorithms exist) except one goes to very large instances.

Quite a few people would like also to have easy problems to show when some algorithms or parameters are really bad.

I think it is fine to do so when this is well motivated (which would be the case in what you mention). However, I would then avoid to make a challenge on solving well polynomial problems if this has now positive effect on the original NP-hard ones as this can easily de-generate in a pseudo-competition (at least if one does it only experimentally—in occasional situations I would now exclude that something interesting results ..)

Suggestions by Hernan Aguirre (On Twitter)

Kauffman's NK-landscapes for single-objective optimization and its extensions to multi- and many-objectives (MNK-landacapes and rhoMNK-landscapes) offer the possibility of testing on sub-classes of problems that require binary representation. Are a must for serious benchmarking.

Suggestions by Dirk Sudholt

There are a lot of well-studied problems with a property called bit-flip symmetry: flipping all bits gives a solution of the same fitness. This means that there are always at least two global optima. Such problems have been popular as search algorithms need to break the symmetry between good solutions [1].

Well-known examples include the function TwoMax := \max\{\sum_{i=1}^n x_i, \sum_{i=1}^n (1-x_i)\} [1], which has been used as a challenging test bed in theoretical studies of diversity-preserving mechanisms [2-4]. The function H-Iff (Hierarchical If and only If) [5] consists of hierarchical building blocks that need to attain equal values in order to contribute to the fitness. It was studied theoretically in [6,7] and is frequently used in empirical studies, see, e.\,g. [8,9].

In terms of classical combinatorial problems, the Graph Colouring problem asks for an assignment of colours to vertices such that no two adjacent vertices share the same colour. For two colours, a natural setting is to use a binary encoding for the colours of all vertices and to maximise the number of bichromatic edges (edges with differently coloured end points). A closely related setting is that of simple Ising models, where the goal is to minimise the number of bichromatic edges. For bipartite (that is, 2-colourable) graphs, this is identical to maximising the number of bichromatic edges as inverting one set of the bipartition turns all monochromatic edges into bichromatic ones and vice versa. An advantage of the Ising model setting is that good quality solutions are easily recognisable: subgraphs with equal colours contribute well to the fitness. So I suggest to add the Ising model to the benchmark suite, but to keep in mind that it's essentially equivalent to the classical Graph Coloring problem. The following graph classes are of particular interest; all of them are scalable, subject to mild constraints (square numbers, n being one less than a power of 2):

  1. Ring/cycle graphs [10] can be optimised by the (1+1)EA in expected time O(n^3) and by a GA with crossover and fitness sharing in expected time O(n^2) [10]. The landscape contains plateaus that are easy to overcome.
  2. Toroids with side lengths sqrt(n)*sqrt(n) [11] are more difficult to optimise as they contain difficult local optima. Nevertheless, a Metropolis algorithm succeeds expected polynomial time [11].
  3. Complete binary trees [12] contain many local optima, some of which are extremely difficult to escape from (e.g. a root with differently coloured subtrees). While mutation-only EAs need exponential time in expectation, a (2+2)GA with fitness sharing succeeds in expected time O(n^3) [12].

[1] Goldberg, D. E., Van Hoyweghen, C., and Naudts, B. (2002). From TwoMax to the Ising model: Easy and hard symmetrical problems. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO ’02), pages 626–633. Morgan Kaufmann.
[2] Oliveto, P. S., Sudholt, D., and Zarges, C. (2018+). On the benefits and risks of using fitness sharing for multimodal optimisation. Theoretical Computer Science. To appear.
[3] Covantes Osuna, E. and Sudholt, D. (2018+a). On the runtime analysis of the clearing diversity-preserving mechanism. Evolutionary Computation. To appear.
[4] Covantes Osuna, E. and Sudholt, D. (2018b). Runtime analysis of probabilistic crowding and restricted tournament selection for bimodal optimisation. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2018), pages 929–936. ACM.
[5] Watson, R. A., Hornby, G. S., and Pollack, J. B. (1998). Modeling building-block interdependency. In Parallel Problem Solving from Nature — PPSN V, pages 97–106. Springer Berlin Heidelberg.
[6] Dietzfelbinger, M., Naudts, B., Hoyweghen, C. V., and Wegener, I. (2003). The analysis of a recombinative hill-climber on H-IFF. IEEE Transactions on Evolutionary Computation, 7(5):417–423.
[7] Goldman, B. W. and Sudholt, D. (2016). Runtime analysis for the parameter-less population pyramid. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2016), pages 669–676. ACM.
[8] Thierens, D. and Bosman, P. A. N. (2013). Hierarchical problem solving with the linkage tree genetic algorithm. In Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO 2013), pages 877–884. ACM.
[9] Goldman, B. W. and Punch, W. F. (2015). Fast and efficient black box optimization using the parameter-less population pyramid. Evolutionary Computation, 23(3):451–479.
[10] Fischer, S. and Wegener, I. (2005). The one-dimensional Ising model: Mutation versus recombination. Theoretical Computer Science, 344(2–3):208–225.
[11] Fischer, S. (2004). A polynomial upper bound for a mutation-based algorithm on the two-dimensional Ising model. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO ’04), pages 1100–1112. Springer.
[12] Sudholt, D. (2005). Crossover is provably essential for the Ising model on trees. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO ’05), pages 1161–1167. ACM Press.

PART B: Evaluating Algorithm performance

Apart from defining an initial set of benchmark functions appropriate ways to compare algorithm performance will have to be agreed upon.

Suggestions by Thomas Weise

Traditionally one would let the algorithms run for a fixed amount of time (which may be the number of fitness function evaluations) and then look at the end result. It is well understood that this may not be the best method. Instead, several measurements should be collected to document the progress of each run. In our work on the TSP Suite, we built a software along the lines of COCO, but for the TSP.

A “complete” benchmark, should also prescribe how the measurements should be stored (e.g., csv text file in a specific directory structure). If that can be done, than all researchers will immediately have comparable results. COCO/BBOB/TSP Suite do this implicitly by providing the experiment execution and evaluation as a single software package. As a result we avoid ending up with essentially incomparable experimental results.

APPENDIX: Wiki material of previous WG3 Leaders

Call for Benchmark Problems - How to Contribute

In our initial phase we are seeking suggestions for benchmark problems. You are invited to contribute by copying this Template Benchmark Problem page and linking it in the index below. Furthermore, feel free to augment or comment any ideas provided so far. Do not forget to subscribe to at least your own pages (under menu item ``manage subscriptions'') in order to receive email-notifications whenever colleagues augment, annotate, or change something! We hope that a fruitful discussion will emerge!

List of Suggested Benchmark Problems

Further Suggestions, Issues, or Aspects to Discuss

Please create a new page for any other ideas, suggestions, issues etc. that you want to discuss here. Add new entries at the top!

* First Suggestion (to be replaced)


WG3 Leader and Vice-Leaders: Pietro S. Oliveto, Borys Wrobel,Aleš Zamuda

Recent New Members (Since October 2017)

Jörg Lässig, Thomas Weise, Thomas Stuetzle


wg3.txt · Last modified: 2018/09/26 15:28 by sudholt