Optimization is a collection of methods designed to maximize or minimize a certain objective function over a certain domain under certain constraints. Classic optimization methods are built for relatively low-dimensional spaces and impose a specific functional form on the objective function and/or the constraints. Integer programming assumes that the objective function is a mapping from a discrete set of combinations into real numbers. As a consequence, methods of combinatorial search are utilized. Linear programming assumes a linear objective function and a linear set of constraints. Here the successful methods are a combination of calculus and combinatorial search.
Convex optimization studies differentiable objective functions, borrowing results from multivariate calculus and differential geometry. Convex optimization theory is very well developed, containing many closed-form solutions and robust algorithms.
Still, many real world problems are more interesting than fine-tuning just a few scalar parameters. You may be required to enter only one parameter but it may be this big, multi-dimensional (sometimes continuous) strategy that you will have to follow. For example, you may be required to rebalance an investment portfolio daily (or as frequently as you can) to achieve the best return-to-risk ratio at the end of the investment horizon. This task is known as Merton's problem and is a simple example of Markov decision process (MDP). Many life situations can be formalized in terms of MDPs. In short, MDP means that
1) at each step the system can be in one of the states,
2) in each state one of the given actions has to be taken,
3) at each step a reward is offered, which is a deterministic function of the state and the action taken,
4) the distribution of the next state is completely determined by the current state and the action taken.
If the dimensionality of states is low and the transition probabilities between the states are known, an MDP can be solved by methods of dynamic programming. Notable examples of dynamic programming are the value iteration and policy iteration algorithms. Essentially, dynamic programming is applying a Bellman equation repeatedly to different steps and different states. If the dimensionality of states is high (games, marketing, manufacturing chain optimization), dynamic programming may suffer from slowness of execution and memory problems. Also, the transition probabilities and reward function may be unknown and may have to be inferred from historical data. The historical data may contain complete or incomplete paths of MDP. Or it may contain just the current, unraveling path, so estimation and strategy calculation may have to be done "on the fly". Reinforcement learning is a field which covers such situations. Reinforcement learning methods are iterative. Oftentimes they are based on basis approximations of relevant value functions and/or (non-)parametric estimates of transition probabilities. As such, reinforcement learning is a marriage between dynamic optimization and statistical theory of nonlinear modeling.
Stepping aside from Markov decision processes, there is another interesting line of thought - evolutionary algorithms. In a way, evolutionary algorithms go one step further. They say: your input to the system does not have to be a strategy. It can be a computer program solving for this strategy! As you can see, evolutionary algorithms are pushing the boundaries of artificial intelligence a bit. In the most researched version, however, they still work with traditional input parameters, i.e. vectors and functions. The main idea of evolutionary algorithms is to perform repeated stochastic search over possible solutions in a manner resembling natural selection. At every stage there is a population of candidate solutions, referred to as "generation". The "fitness" of each solution equals the objective function. Each solution is referred to as "individual" and its numeric characterization is viewed as "genes". "Reproduction" as a process of gene enhancement and enrichment is achieved through the following sequence of steps.
1. Generate a population of size N at random. This is generation 1.
2. For each generation t repeat the following.
2.1 Sample two "parents" at random from the population, with probabilities monotonically increasing with their fitness. Create an offspring from parents' genes via the mechanism of "crossover": one or more gene segments are chosen at random and the segments are exchanged between the parents; one of the two created gene sequences is chosen as the offspring. Repeat the process C times.
2.2 Sample one "parent" at random from the population, with probability monotonically increasing with its fitness. Create an offspring from parent's genes via the mechanism of "mutation": choose a random segment of parent's genes and substitute it with freshly simulated values. Repeat the process M times.
2.3 Kill Kt offsprings with lowest fitness.
2.4 Set aside the top N - C - M + Kt members of the current generation. This is the "elite" that lives into the next generation. Kill the remaining members of the current generation and substitute them with C + M - Kt offsprings. A new generation has been born.
2.5 Stop if there is no improvement in the fitness of the best individuals for the last S generations.
3. Let T be the final generation. Select the fittest individual from one of the recent generations T - S + 1, ..., T. For example, use generation T - S + 1 as the last time before the population started degenerating due to inbreeding.
There may be slight variations to the stated template in terms of generation size, elitism, stopping criterion, etc. Evolutionary algorithms allow the system to completely reinvent itself and jump in new directions. For that reason they are important tools in the search for global maximum, whereas many other optimization methods tend to settle in local maxima. Another important property of evolutionary algorithms is their making no assumptions about the objective function (fitness). So discrete and highly irregular functions can be studied... Evolutionary algorithms are sometimes called genetic algorithms. In the narrow sense of this word, genetic algorithm is an evolutionary algorithm which assumes that the solutions can be represented as vectors.
While evolutionary algorithms mimic reproduction among many biological organisms, simulated annealing mimics the life pass of a single biological organism. It says: while you are young, you are allowed to do crazy things, you are allowed to experiment. The purpose of that is finding some interesting, unexpected solutions to your life goals. When you grow older, however, it is time to settle. It is time to embrace the lifestyle that has proven to work well... There are also parallels with metallurgy, where there is a technique of controlled heating and cooling metals to make them stronger. Simulated annealing is an iterative algorithm where the search for the global optimum is stochastic and governed by the so-called temperature schedule. At the beginning the temperature is set high and may be allowed to stay high for a while (an equivalent of heating). Sooner or later, however, the temperature starts going down and converges to 0 (cooling). Suppose our goal is maximizing function V(x) and xi is the current solution at the i-th iteration. We are allowed to perturb xi slightly to produce a candidate solution xi'. At this point simulated annealing is quite open-minded, unlike many other methods. It says: if V(xi') > V(xi) we will surely accept xi' as the next iteration. But even if that does not happen we will give xi' the benefit of the doubt and will accept it with some probability below 1. The acceptance probability is set to be a function decreasing with V(xi) - V(xi') and temperature. So in high temperature environments it is relatively easy to try new things, while in low temperature environments it is hard.
Different versions of the simulated annealing algorithm have been proven to converge to the global optimum in different problems. Of course, these are just convergence results. We do not know how long the converge will take, with all the heating and cooling involved. Note that the space of possible solutions does not have to be continuous. It can be discrete since perturbation is defined for algebraic objects as well, e.g. discrete numbers, graphs, groups, permutations, etc. Say, with permutations the perturbation operation is defined naturally as exchange of any two elements.
Robust optimization deals with minimax problems, where not all the specifications of the system are known and one has to find the best solution for the worst specification. Robust optimization borrows methods from deterministic nonlinear optimization, stochastic search and stochastic approximation.
OPTIMIZATION REFERENCES
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Spall, J. C. (2003). Introduction to Stochastic Search and Optimization. Wiley-Interscience, Hoboken, New Jersey. - Standard reference on stochastic optimization.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. - The book is longer than it could be, but if you read the book you will see the power that reinforcement learning techniques can give you. The estimation algorithm may not converge for days or it may not converge at all, but if it does very complex, multi-dimensional problems are solved and optimal strategies are determined.
Prékopa, A. (2010). Stochastic Programming (Mathematics and Its Applications). Kluwer Academic Publishers.
Ben-Tal, A., El Ghaoui, L., & Nemirovski, A. (2009). Robust Optimization. Princeton University Press.
Kouvelis, P., & Yu, G. (2010). Robust Discrete Optimization and Its Applications (Nonconvex Optimization and Its Applications). Kluwer Academic Publishers.
OPTIMIZATION RESOURCES