Chapter 7: Optimization Algorithms

Introduction: The Parameter Tuning Problem

Every quant trader faces the same frustrating question: “What parameters should I use?”

Your moving average crossover strategy—should it be 10/50 or 15/75?
Your portfolio weights—how much BTC vs ETH?
Your stop loss—5% or 10%?
Your mean-reversion threshold—2 standard deviations or 2.5?

Trial and error is expensive. Each backtest takes time, and testing all combinations creates data-min

ing bias (overfitting). You need a systematic way to find optimal parameters.

This chapter teaches you optimization algorithms—mathematical methods for finding the best solution to a problem. We’ll start with the hiking analogy that makes gradient descent intuitive, build up to convex optimization’s guaranteed solutions, and finish with genetic algorithms for when gradients fail.

What you’ll learn:

Gradient Descent: Follow the slope downhill—fast when you can compute derivatives
Convex Optimization: Portfolio problems with provably optimal solutions
Genetic Algorithms: Evolution-inspired search for complex parameter spaces
Simulated Annealing: Escape local optima by accepting worse solutions probabilistically
Decision Framework: How to choose the right optimizer for your problem

Pedagogical approach: Start with intuition (hiking downhill), formalize the mathematics, implement working code, then apply to real trading problems. Every algorithm includes a “when to use” decision guide.

7.1 Gradient Descent: Following the Slope

7.1.1 The Hiking Analogy

Imagine you’re hiking on a foggy mountain. You can’t see the valley below, but you can feel the slope beneath your feet. How do you reach the bottom?

Naive approach: Walk in a random direction. Sometimes you descend, sometimes you climb. Inefficient.

Smart approach: Feel the ground, determine which direction slopes down most steeply, take a step in that direction. Repeat until the ground is flat (you’ve reached the bottom).

This is gradient descent. The “slope” is the gradient (derivative), and “walking downhill” is updating parameters in the direction that decreases your objective function.

7.1.2 Mathematical Formulation

Goal: Minimize a function $f(\theta)$ where $\theta$ represents parameters (e.g., portfolio weights, strategy thresholds).

Gradient descent update rule:

$$\theta_{t+1} = \theta_t - \alpha \nabla f(\theta_t)$$

Components:

$\theta_t$: current parameter value at iteration $t$
$\nabla f(\theta_t)$: gradient (slope) of $f$ at $\theta_t$
$\alpha$: learning rate (step size)—how far to move in each iteration
$\theta_{t+1}$: updated parameter value

Intuition:

If $\nabla f > 0$ (slope is positive), decrease $\theta$ (move left)
If $\nabla f < 0$ (slope is negative), increase $\theta$ (move right)
If $\nabla f = 0$ (flat ground), stop—you’ve found a minimum

Example: Minimize $f(x) = (x - 3)^2$

Derivative: $f’(x) = 2(x - 3)$
At $x = 0$: gradient $= 2(0-3) = -6$ (negative → increase $x$)
At $x = 5$: gradient $= 2(5-3) = +4$ (positive → decrease $x$)
At $x = 3$: gradient $= 0$ (minimum found!)

;; 1D Gradient Descent
;; Parameters:
;;   f: objective function to minimize
;;   df: derivative of f (gradient)
;;   x0: initial guess for parameter
;;   alpha: learning rate (step size)
;;   max-iters: maximum iterations
;;   tolerance: stop when change < tolerance
;; Returns: {:optimum x*, :history [x_0, x_1, ..., x_n]}
(define (gradient-descent-1d f df x0 alpha max-iters tolerance)
  (let ((x x0)
        (history [x0]))

    (for (i (range 0 max-iters))
      ;; Compute gradient at current position
      (let ((grad (df x)))

        ;; Update: x_new = x - alpha * gradient
        (let ((x-new (- x (* alpha grad))))

          (set! history (append history x-new))

          ;; Check convergence: stop if change < tolerance
          (if (< (abs (- x-new x)) tolerance)
              (do
                (log :message "Converged" :iteration i :value x-new)
                (set! x x-new)
                (break))  ;; Conceptual break—would need loop control
              (set! x x-new)))))

    {:optimum x :history history}))

;; Example: Minimize f(x) = (x - 3)²
;; This is a quadratic with minimum at x = 3
(define (f x)
  (let ((diff (- x 3)))
    (* diff diff)))

;; Derivative: f'(x) = 2(x - 3)
(define (df x)
  (* 2 (- x 3)))

;; Run gradient descent
(define result
  (gradient-descent-1d
    f              ;; Function to minimize
    df             ;; Its derivative
    0.0            ;; Start at x = 0
    0.1            ;; Learning rate = 0.1
    100            ;; Max 100 iterations
    0.0001))       ;; Stop when change < 0.0001

;; Result: {:optimum 3.0000, :history [0, 0.6, 1.08, 1.464, ...]}
;; Converges to x = 3 (exact solution!)

Learning Rate Selection:

The learning rate $\alpha$ controls step size:

Learning Rate	Behavior	Iterations	Risk
Too small ($\alpha = 0.001$)	Tiny steps	1000+	Slow convergence
Optimal ($\alpha = 0.1$)	Steady progress	~20	None
Too large ($\alpha = 0.5$)	Overshoots	Oscillates	Divergence
Way too large ($\alpha = 2.0$)	Explodes	Infinite	Guaranteed failure

Rule of thumb: Start with $\alpha = 0.01$. If it diverges, halve it. If it’s too slow, double it.

7.1.3 Multi-Dimensional Gradient Descent

Most trading problems have multiple parameters. Portfolio optimization requires weights for each asset. Strategy tuning needs multiple thresholds.

Extension to vectors:

$$\theta_{t+1} = \theta_t - \alpha \nabla f(\theta_t)$$

where $\theta = [\theta_1, \theta_2, …, \theta_n]$ and $\nabla f = [\frac{\partial f}{\partial \theta_1}, \frac{\partial f}{\partial \theta_2}, …, \frac{\partial f}{\partial \theta_n}]$.

;; N-dimensional gradient descent
;; Parameters:
;;   f: objective function (takes vector, returns scalar)
;;   grad: gradient function (takes vector, returns vector of partial derivatives)
;;   theta0: initial parameter vector [θ₁, θ₂, ..., θₙ]
;;   alpha: learning rate
;;   max-iters: maximum iterations
;;   tolerance: L2 norm convergence threshold
(define (gradient-descent-nd f grad theta0 alpha max-iters tolerance)
  (let ((theta theta0)
        (history [theta0]))

    (for (i (range 0 max-iters))
      ;; Compute gradient vector
      (let ((g (grad theta)))

        ;; Update each parameter: θⱼ_new = θⱼ - α * ∂f/∂θⱼ
        (let ((theta-new
               (map (range 0 (length theta))
                    (lambda (j)
                      (- (nth theta j) (* alpha (nth g j)))))))

          (set! history (append history theta-new))

          ;; Check convergence using L2 norm of change
          (if (< (l2-norm (vec-sub theta-new theta)) tolerance)
              (do
                (log :message "Converged" :iteration i)
                (set! theta theta-new)
                (break))
              (set! theta theta-new)))))

    {:optimum theta :history history}))

;; Helper: L2 norm (Euclidean length of vector)
(define (l2-norm v)
  (sqrt (sum (map v (lambda (x) (* x x))))))

;; Helper: Vector subtraction
(define (vec-sub a b)
  (map (range 0 (length a))
       (lambda (i) (- (nth a i) (nth b i)))))

;; Example: Minimize f(x,y) = x² + y²
;; This is a paraboloid with minimum at (0, 0)
(define (f-2d params)
  (let ((x (nth params 0))
        (y (nth params 1)))
    (+ (* x x) (* y y))))

;; Gradient: ∇f = [2x, 2y]
(define (grad-2d params)
  (let ((x (nth params 0))
        (y (nth params 1)))
    [(* 2 x) (* 2 y)]))

;; Run from initial point (5, 5)
(define result-2d
  (gradient-descent-nd
    f-2d
    grad-2d
    [5.0 5.0]   ;; Start at (5, 5)
    0.1         ;; Learning rate
    100
    0.0001))

;; Result: {:optimum [0.0, 0.0], ...}
;; Converges to (0, 0) as expected

Visualizing convergence:

If you plot the history, you’ll see the path spiral toward (0,0):

Iteration 0: (5.0, 5.0)
Iteration 1: (4.0, 4.0)
Iteration 2: (3.2, 3.2)
…
Iteration 20: (0.01, 0.01)

7.1.4 Momentum: Accelerating Convergence

Problem: Vanilla gradient descent oscillates in narrow valleys—takes tiny steps along the valley, wastes time bouncing side-to-side.

Solution: Momentum accumulates gradients over time, smoothing out oscillations.

Physical analogy: A ball rolling downhill gains momentum—it doesn’t stop and restart at each bump.

Update rule:

$$\begin{aligned} v_{t+1} &= \beta v_t + (1-\beta) \nabla f(\theta_t) \ \theta_{t+1} &= \theta_t - \alpha v_{t+1} \end{aligned}$$

where:

$v_t$ = velocity (accumulated gradient)
$\beta$ = momentum coefficient (typically 0.9)

Interpretation:

$\beta = 0$: No momentum (vanilla gradient descent)
$\beta = 0.9$: 90% of previous velocity + 10% new gradient
$\beta = 0.99$: Very high momentum—slow to change direction

;; Gradient descent with momentum
(define (gradient-descent-momentum f grad theta0 alpha beta max-iters tolerance)
  (let ((theta theta0)
        (velocity (map theta0 (lambda (x) 0)))  ;; Initialize v = 0
        (history [theta0]))

    (for (i (range 0 max-iters))
      (let ((g (grad theta)))

        ;; Update velocity: v = β*v + (1-β)*gradient
        (let ((velocity-new
               (map (range 0 (length velocity))
                    (lambda (j)
                      (+ (* beta (nth velocity j))
                         (* (- 1 beta) (nth g j)))))))

          (set! velocity velocity-new)

          ;; Update parameters: θ = θ - α*v
          (let ((theta-new
                 (map (range 0 (length theta))
                      (lambda (j)
                        (- (nth theta j) (* alpha (nth velocity j)))))))

            (set! history (append history theta-new))

            (if (< (l2-norm (vec-sub theta-new theta)) tolerance)
                (do
                  (set! theta theta-new)
                  (break))
                (set! theta theta-new))))))

    {:optimum theta :history history}))

;; Example: Same function f(x,y) = x² + y², but with momentum
(define result-momentum
  (gradient-descent-momentum
    f-2d
    grad-2d
    [5.0 5.0]
    0.1        ;; Learning rate
    0.9        ;; Momentum (beta = 0.9)
    100
    0.0001))

;; Typically converges 2-3x faster than vanilla GD

Convergence Comparison:

Method	Iterations to Converge	Oscillations
Vanilla GD	150	High (zigzags)
Momentum (β=0.9)	45	Low (smooth)
Speedup	3.3x faster	-

7.1.5 Adam: Adaptive Learning Rates

Problem: Different parameters may need different learning rates. Portfolio weights for volatile assets need smaller steps than stable assets.

Adam (Adaptive Moment Estimation) adapts learning rates per parameter using:

First moment (momentum): Exponential moving average of gradients
Second moment: Exponential moving average of squared gradients (variance)

Update rules:

$$\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) \nabla f(\theta_t) \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) (\nabla f(\theta_t))^2 \ \hat{m}_t &= \frac{m_t}{1-\beta_1^t}, \quad \hat{v}t = \frac{v_t}{1-\beta_2^t} \ \theta{t+1} &= \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{aligned}$$

Default hyperparameters: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$

Interpretation:

$m_t$: First moment (mean gradient direction)
$v_t$: Second moment (variance of gradients)
Division by $\sqrt{v_t}$: Smaller steps when gradients are noisy

;; Adam optimizer
(define (adam-optimizer f grad theta0 alpha max-iters tolerance)
  (let ((theta theta0)
        (m (map theta0 (lambda (x) 0)))  ;; First moment
        (v (map theta0 (lambda (x) 0)))  ;; Second moment
        (beta1 0.9)
        (beta2 0.999)
        (epsilon 1e-8)
        (history [theta0]))

    (for (t (range 1 (+ max-iters 1)))
      (let ((g (grad theta)))

        ;; Update biased first moment: m = β₁*m + (1-β₁)*g
        (let ((m-new
               (map (range 0 (length m))
                    (lambda (j)
                      (+ (* beta1 (nth m j))
                         (* (- 1 beta1) (nth g j)))))))

          ;; Update biased second moment: v = β₂*v + (1-β₂)*g²
          (let ((v-new
                 (map (range 0 (length v))
                      (lambda (j)
                        (+ (* beta2 (nth v j))
                           (* (- 1 beta2) (nth g j) (nth g j)))))))

            (set! m m-new)
            (set! v v-new)

            ;; Bias correction
            (let ((m-hat (map m (lambda (mj) (/ mj (- 1 (pow beta1 t))))))
                  (v-hat (map v (lambda (vj) (/ vj (- 1 (pow beta2 t)))))))

              ;; Update parameters: θ = θ - α * m̂ / (√v̂ + ε)
              (let ((theta-new
                     (map (range 0 (length theta))
                          (lambda (j)
                            (- (nth theta j)
                               (/ (* alpha (nth m-hat j))
                                  (+ (sqrt (nth v-hat j)) epsilon)))))))

                (set! history (append history theta-new))

                (if (< (l2-norm (vec-sub theta-new theta)) tolerance)
                    (do
                      (set! theta theta-new)
                      (break))
                    (set! theta theta-new))))))))

    {:optimum theta :history history}))

Adam Advantages:

Adaptive learning rates per parameter
Robust to hyperparameter choice (works with default values)
Fast convergence on non-convex landscapes
De facto standard for deep learning

When to use Adam:

Complex, non-convex optimization
Many parameters with different scales
Noisy gradients
Simple convex problems (vanilla GD is fine)

7.2 Convex Optimization: Guaranteed Global Optima

7.2.1 Why Convexity Matters

Convex function: A bowl-shaped function where every local minimum is a global minimum.

Formally: $f$ is convex if for any two points $x$, $y$ and any $\lambda \in [0,1]$:

$$f(\lambda x + (1-\lambda) y) \leq \lambda f(x) + (1-\lambda) f(y)$$

Intuition: The line segment connecting any two points on the graph lies above the graph.

Examples:

Convex: $f(x) = x^2$, $f(x) = e^x$, $f(x) = |x|$
Not convex: $f(x) = x^3$, $f(x) = \sin(x)$

Why this matters:

For convex functions, gradient descent always finds the global minimum. No local optima traps!

Portfolio optimization is convex: Minimizing variance subject to return constraints is a convex problem (quadratic programming).

7.2.2 Linear Programming (LP)

Linear Program: Optimize a linear objective subject to linear constraints.

Standard form:

$$\begin{aligned} \min_x \quad & c^T x \ \text{s.t.} \quad & Ax \leq b \ & x \geq 0 \end{aligned}$$

Example: Portfolio with transaction costs

Maximize expected return:

Objective: $\max \mu^T w$ (return = weighted average of asset returns)
Constraints:
- $\sum w_i = 1$ (fully invested)
- $w_i \geq 0$ (long-only)
- $\sum |w_i - w_i^{\text{old}}| \leq 0.1$ (max 10% turnover)

;; Simplified LP for portfolio optimization
;; Real implementations use specialized solvers (simplex, interior-point)
;; This is a greedy heuristic for pedagogy
(define (lp-portfolio-simple expected-returns max-turnover)
  (let ((n (length expected-returns)))

    ;; Greedy heuristic: allocate to highest expected return
    ;; (Not optimal, but illustrates LP concept)
    (let ((sorted-indices (argsort expected-returns >)))  ;; Descending order

      ;; Allocate 100% to best asset (within turnover limit)
      (let ((weights (make-array n 0)))
        (set-nth! weights (nth sorted-indices 0) 1.0)
        weights))))

;; Example
(define expected-returns [0.08 0.12 0.10 0.15])  ;; 8%, 12%, 10%, 15%
(define optimal-weights (lp-portfolio-simple expected-returns 0.1))
;; → [0, 0, 0, 1.0]  (100% in highest-return asset)

Real-world LP applications:

Trade execution: Minimize transaction costs subject to volume constraints
Arbitrage detection: Find profitable cycles in exchange rates
Portfolio rebalancing: Minimize turnover while achieving target exposure

LP solvers: Use specialized libraries (CPLEX, Gurobi, GLPK) for production—they’re orders of magnitude faster.

7.2.3 Quadratic Programming (QP): Markowitz Portfolio

Quadratic Program: Quadratic objective with linear constraints.

Markowitz mean-variance optimization:

$$\begin{aligned} \min_w \quad & \frac{1}{2} w^T \Sigma w - \lambda \mu^T w \ \text{s.t.} \quad & \mathbf{1}^T w = 1 \ & w \geq 0 \end{aligned}$$

where:

$w$ = portfolio weights
$\Sigma$ = covariance matrix (risk)
$\mu$ = expected returns
$\lambda$ = risk aversion (larger $\lambda$ → more aggressive)

Interpretation:

Objective = Risk - Return
Minimize risk while maximizing return (trade-off controlled by $\lambda$)

Analytical solution (unconstrained):

$$w^* = \frac{1}{\lambda} \Sigma^{-1} \mu$$

;; Markowitz mean-variance optimization (unconstrained)
;; Parameters:
;;   expected-returns: vector of expected returns [μ₁, μ₂, ..., μₙ]
;;   covariance-matrix: nxn covariance matrix Σ
;;   risk-aversion: λ (larger = more risk-tolerant)
;; Returns: optimal weights (before normalization)
(define (markowitz-portfolio expected-returns covariance-matrix risk-aversion)
  (let ((n (length expected-returns)))

    ;; Analytical solution: w* = (1/λ) * Σ⁻¹ * μ
    (let ((sigma-inv (matrix-inverse covariance-matrix)))

      (let ((w-unconstrained
             (matrix-vec-mult sigma-inv expected-returns)))

        (let ((scaled-w
               (map w-unconstrained
                    (lambda (wi) (/ wi risk-aversion)))))

          ;; Normalize to sum to 1
          (let ((total (sum scaled-w)))
            (map scaled-w (lambda (wi) (/ wi total)))))))))

;; Example
(define mu [0.10 0.12 0.08])  ;; Expected returns: 10%, 12%, 8%
(define sigma
  [[0.04 0.01 0.02]           ;; Covariance matrix
   [0.01 0.09 0.01]
   [0.02 0.01 0.16]])

(define optimal-weights (markowitz-portfolio mu sigma 2.0))
;; Typical result: [0.45, 0.35, 0.20] (diversified across assets)

Efficient Frontier:

Plot all optimal portfolios for varying risk aversion:

;; Generate efficient frontier (risk-return trade-off curve)
(define (efficient-frontier mu sigma risk-aversions)
  (let ((frontier []))

    (for (lambda-val risk-aversions)
      (let ((weights (markowitz-portfolio mu sigma lambda-val)))

        ;; Calculate portfolio return and risk
        (let ((port-return (dot-product weights mu))
              (port-variance (quadratic-form weights sigma)))

          (set! frontier (append frontier
                                {:risk (sqrt port-variance)
                                 :return port-return
                                 :weights weights
                                 :lambda lambda-val})))))

    frontier))

;; Helper: Quadratic form w^T Σ w
(define (quadratic-form x A)
  (sum (map (range 0 (length x))
           (lambda (i)
             (sum (map (range 0 (length x))
                      (lambda (j)
                        (* (nth x i) (nth x j) (nth (nth A i) j)))))))))

;; Generate 20 portfolios along the efficient frontier
(define frontier
  (efficient-frontier mu sigma (linspace 0.5 10.0 20)))

;; Plot: frontier points show (risk, return) pairs
;; Higher risk → higher return (as expected)

Maximum Sharpe Ratio:

The Sharpe ratio is return per unit risk:

$$\text{Sharpe} = \frac{w^T (\mu - r_f \mathbf{1})}{\sqrt{w^T \Sigma w}}$$

Maximizing Sharpe is non-linear, but can be transformed to QP:

Let $y = w / (w^T (\mu - r_f \mathbf{1}))$, then solve:

$$\begin{aligned} \min_y \quad & y^T \Sigma y \ \text{s.t.} \quad & y^T (\mu - r_f \mathbf{1}) = 1, \quad y \geq 0 \end{aligned}$$

Recover $w = y / (\mathbf{1}^T y)$.

;; Maximum Sharpe ratio portfolio
(define (max-sharpe-portfolio mu sigma rf)
  (let ((excess-returns (map mu (lambda (r) (- r rf)))))

    ;; Heuristic: approximate with unconstrained solution
    ;; (Real QP solver needed for constraints)
    (let ((sigma-inv (matrix-inverse sigma)))
      (let ((y (matrix-vec-mult sigma-inv excess-returns)))

        ;; Normalize: w = y / sum(y)
        (let ((total (sum y)))
          (map y (lambda (yi) (/ yi total))))))))

;; Example
(define max-sharpe-w (max-sharpe-portfolio mu sigma 0.02))  ;; rf = 2%
;; Maximizes return/risk ratio

7.3 Genetic Algorithms: Evolution-Inspired Optimization

7.3.1 When Gradients Fail

Gradient-based methods require:

Differentiable objective function
Continuous parameters

But many trading problems violate these:

Discrete parameters: Moving average period must be an integer (can’t use 15.7 days)
Non-differentiable: Profit/loss from backtest (discontinuous at stop loss)
Black-box: Objective function is a simulation—no closed-form derivative

Solution: Genetic Algorithms (GA) search without gradients, inspired by biological evolution.

7.3.2 The Evolution Analogy

Nature’s optimizer: Evolution optimizes organisms through:

Selection: Fittest individuals survive
Crossover: Combine traits from two parents
Mutation: Random changes introduce novelty
Iteration: Repeat for many generations

GA applies this to optimization:

Population: Collection of candidate solutions (e.g., 100 parameter sets)
Fitness: Evaluate each candidate (e.g., backtest Sharpe ratio)
Selection: Choose best candidates as parents
Crossover: Combine parents to create offspring
Mutation: Randomly perturb offspring
Replacement: New generation replaces old

Repeat until convergence (fitness plateaus or max generations reached).

7.3.3 Implementation

;; Genetic algorithm for parameter optimization
;; Parameters:
;;   fitness-fn: function that evaluates a candidate (higher = better)
;;   param-ranges: array of {:min, :max} for each parameter
;;   pop-size: population size (e.g., 50-200)
;;   generations: number of generations to evolve
;; Returns: best individual found
(define (genetic-algorithm fitness-fn param-ranges pop-size generations)
  (let ((population (initialize-population param-ranges pop-size))
        (best-ever null)
        (best-fitness -999999))

    (for (gen (range 0 generations))
      ;; Evaluate fitness for all individuals
      (let ((fitness-scores
             (map population (lambda (individual) (fitness-fn individual)))))

        ;; Track best individual ever seen
        (let ((gen-best-idx (argmax fitness-scores))
              (gen-best-fitness (nth fitness-scores gen-best-idx)))

          (if (> gen-best-fitness best-fitness)
              (do
                (set! best-fitness gen-best-fitness)
                (set! best-ever (nth population gen-best-idx)))
              null)

          ;; Log progress every 10 generations
          (if (= (% gen 10) 0)
              (log :message "Generation" :gen gen
                   :best-fitness best-fitness)
              null))

        ;; Selection: tournament selection
        (let ((parents (tournament-selection population fitness-scores pop-size)))

          ;; Crossover and mutation: create next generation
          (let ((offspring (crossover-and-mutate parents param-ranges)))

            (set! population offspring)))))

    ;; Return best individual found
    best-ever))

;; Initialize random population
;; Each individual is a vector of parameter values
(define (initialize-population param-ranges pop-size)
  (let ((population []))

    (for (i (range 0 pop-size))
      (let ((individual
             (map param-ranges
                  (lambda (range)
                    ;; Random value in [min, max]
                    (+ (range :min)
                       (* (random) (- (range :max) (range :min))))))))

        (set! population (append population individual))))

    population))

7.3.4 Selection Methods

Tournament Selection: Randomly sample $k$ individuals, choose the best.

;; Tournament selection
;; Randomly pick tournament-size individuals, select best
;; Repeat pop-size times to create parent pool
(define (tournament-selection population fitness-scores pop-size)
  (let ((parents [])
        (tournament-size 3))  ;; Typically 3-5

    (for (i (range 0 pop-size))
      ;; Random tournament
      (let ((tournament-indices
             (map (range 0 tournament-size)
                  (lambda (_) (floor (* (random) (length population)))))))

        ;; Find best in tournament
        (let ((best-idx
               (reduce tournament-indices
                       (nth tournament-indices 0)
                       (lambda (best idx)
                         (if (> (nth fitness-scores idx)
                               (nth fitness-scores best))
                             idx
                             best)))))

          (set! parents (append parents (nth population best-idx))))))

    parents))

Roulette Wheel Selection: Probability proportional to fitness.

;; Roulette wheel selection (fitness-proportionate)
(define (roulette-selection population fitness-scores n)
  (let ((total-fitness (sum fitness-scores))
        (selected []))

    (for (i (range 0 n))
      (let ((spin (* (random) total-fitness))
            (cumulative 0)
            (selected-idx 0))

        ;; Find individual where cumulative fitness exceeds spin
        (for (j (range 0 (length fitness-scores)))
          (set! cumulative (+ cumulative (nth fitness-scores j)))

          (if (>= cumulative spin)
              (do
                (set! selected-idx j)
                (break))
              null))

        (set! selected (append selected (nth population selected-idx)))))

    selected))

7.3.5 Crossover and Mutation

Uniform Crossover: Each gene (parameter) has 50% chance from each parent.

Mutation: Small random perturbation with low probability.

;; Crossover and mutation
(define (crossover-and-mutate parents param-ranges)
  (let ((offspring [])
        (crossover-rate 0.8)   ;; 80% of pairs undergo crossover
        (mutation-rate 0.1))   ;; 10% mutation probability per gene

    ;; Pair up parents (assumes even population size)
    (for (i (range 0 (/ (length parents) 2)))
      (let ((parent1 (nth parents (* i 2)))
            (parent2 (nth parents (+ (* i 2) 1))))

        ;; Initialize children as copies of parents
        (let ((child1 parent1)
              (child2 parent2))

          ;; Crossover (uniform)
          (if (< (random) crossover-rate)
              (do
                ;; Swap genes with 50% probability
                (set! child1 (map (range 0 (length parent1))
                                (lambda (j)
                                  (if (< (random) 0.5)
                                      (nth parent1 j)
                                      (nth parent2 j)))))

                (set! child2 (map (range 0 (length parent2))
                                (lambda (j)
                                  (if (< (random) 0.5)
                                      (nth parent2 j)
                                      (nth parent1 j))))))
              null)

          ;; Mutation
          (set! child1 (mutate-individual child1 param-ranges mutation-rate))
          (set! child2 (mutate-individual child2 param-ranges mutation-rate))

          (set! offspring (append offspring child1))
          (set! offspring (append offspring child2)))))

    offspring))

;; Mutate individual: randomly perturb genes
(define (mutate-individual individual param-ranges mutation-rate)
  (map (range 0 (length individual))
       (lambda (j)
         (if (< (random) mutation-rate)
             ;; Mutate: random value in parameter range
             (let ((range (nth param-ranges j)))
               (+ (range :min)
                  (* (random) (- (range :max) (range :min)))))

             ;; No mutation
             (nth individual j)))))

7.3.6 Example: SMA Crossover Strategy Optimization

Problem: Find optimal moving average periods for a crossover strategy.

Parameters:

Fast MA period: 5-50 days
Slow MA period: 20-200 days

Fitness: Backtest Sharpe ratio

;; Fitness function: backtest SMA crossover strategy
;; params = [fast_period, slow_period]
(define (sma-strategy-fitness params prices)
  (let ((fast-period (floor (nth params 0)))   ;; Round to integer
        (slow-period (floor (nth params 1))))

    ;; Constraint: fast must be < slow
    (if (>= fast-period slow-period)
        -9999  ;; Invalid: return terrible fitness

        ;; Valid: backtest strategy
        (let ((strategy (sma-crossover-strategy fast-period slow-period)))
          (let ((backtest-result (backtest-strategy strategy prices 10000)))
            (backtest-result :sharpe-ratio))))))  ;; Fitness = Sharpe ratio

;; Run genetic algorithm
(define param-ranges
  [{:min 5 :max 50}      ;; Fast MA: 5-50 days
   {:min 20 :max 200}])  ;; Slow MA: 20-200 days

(define best-params
  (genetic-algorithm
    (lambda (params) (sma-strategy-fitness params historical-prices))
    param-ranges
    50        ;; Population size
    100))     ;; 100 generations

;; Result: e.g., [12, 45] (fast=12, slow=45)
;; These are the parameters with highest backtest Sharpe ratio

GA vs Gradient Descent:

Aspect	Genetic Algorithm	Gradient Descent
Requires derivatives	No	Yes
Handles discrete params	Yes	No
Global optimum	Maybe (stochastic)	Only if convex
Computational cost	High (5000+ evaluations)	Low (100 evaluations)
Best for	Complex, black-box	Smooth, differentiable

7.4 Simulated Annealing: Escaping Local Optima

7.4.1 The Metallurgy Analogy

Annealing is a metallurgical process:

Heat metal to high temperature (atoms move freely)
Slowly cool (atoms settle into low-energy state)
Result: Stronger, more stable structure

Key insight: At high temperature, atoms can escape local energy wells, avoiding suboptimal configurations.

Simulated Annealing (SA) applies this to optimization:

“Temperature” = willingness to accept worse solutions
High temperature = explore widely (accept many worse moves)
Low temperature = exploit locally (accept few worse moves)
Gradual cooling = transition from exploration to exploitation

7.4.2 The Algorithm

Accept probability for worse solutions:

$$P(\text{accept worse}) = \exp\left(-\frac{\Delta E}{T}\right)$$

where:

$\Delta E$ = increase in objective (energy)
$T$ = temperature (decreases over time)

Intuition:

If $T$ is high, $e^{-\Delta E/T} \approx 1$ → accept almost anything
If $T$ is low, $e^{-\Delta E/T} \approx 0$ → accept only improvements
If $\Delta E$ is small, more likely to accept (small worsening)

;; Simulated annealing
;; Parameters:
;;   objective: function to minimize (energy)
;;   initial-solution: starting point
;;   neighbor-fn: function that generates a neighbor solution
;;   initial-temp: starting temperature
;;   cooling-rate: multiply temperature by this each iteration (e.g., 0.95)
;;   max-iters: maximum iterations
(define (simulated-annealing objective initial-solution neighbor-fn
                            initial-temp cooling-rate max-iters)
  (let ((current-solution initial-solution)
        (current-energy (objective initial-solution))
        (best-solution initial-solution)
        (best-energy current-energy)
        (temperature initial-temp))

    (for (iter (range 0 max-iters))
      ;; Generate neighbor
      (let ((neighbor (neighbor-fn current-solution)))

        (let ((neighbor-energy (objective neighbor)))

          ;; Always accept better solutions
          (if (< neighbor-energy current-energy)
              (do
                (set! current-solution neighbor)
                (set! current-energy neighbor-energy)

                ;; Update best
                (if (< neighbor-energy best-energy)
                    (do
                      (set! best-solution neighbor)
                      (set! best-energy neighbor-energy))
                    null))

              ;; Accept worse solutions probabilistically
              (let ((delta-E (- neighbor-energy current-energy))
                    (acceptance-prob (exp (- (/ delta-E temperature)))))

                (if (< (random) acceptance-prob)
                    (do
                      (set! current-solution neighbor)
                      (set! current-energy neighbor-energy))
                    null)))))

      ;; Cool down: T ← T * cooling_rate
      (set! temperature (* temperature cooling-rate)))

    {:solution best-solution :energy best-energy}))

7.4.3 Neighbor Generation

Key design choice: How to generate neighbors?

For strategy parameters: Perturb by small random amount (±10% of range).

;; Neighbor function: perturb one random parameter
(define (neighbor-solution-strategy params param-ranges)
  (let ((idx (floor (* (random) (length params))))  ;; Random parameter index
        (range (nth param-ranges idx)))

    ;; Perturbation: ±10% of parameter range
    (let ((perturbation (* 0.1 (- (range :max) (range :min))
                          (- (* 2 (random)) 1))))  ;; Uniform [-1, 1]

      (let ((new-val (+ (nth params idx) perturbation)))

        ;; Clamp to valid range
        (let ((clamped (max (range :min) (min (range :max) new-val))))

          ;; Create new parameter vector with mutated value
          (let ((new-params (copy-array params)))
            (set-nth! new-params idx clamped)
            new-params))))))

;; Example: Optimize SMA parameters with SA
(define sa-result
  (simulated-annealing
    (lambda (params) (- (sma-strategy-fitness params prices)))  ;; Minimize -Sharpe
    [10 30]                           ;; Initial guess
    (lambda (p) (neighbor-solution-strategy p param-ranges))
    100.0                             ;; Initial temperature
    0.95                              ;; Cooling rate (T ← 0.95*T)
    1000))                            ;; 1000 iterations

;; Result: {:solution [12, 45], :energy -1.8}
;; Sharpe ratio = 1.8 (minimized negative = maximized positive)

7.4.4 Cooling Schedules

The cooling schedule controls exploration vs exploitation trade-off:

Schedule	Formula	Characteristics
Exponential	$T_k = T_0 \alpha^k$	Fast, may miss global optimum
Linear	$T_k = T_0 - k \beta$	Slow, thorough exploration
Logarithmic	$T_k = T_0 / \log(k+2)$	Very slow, theoretical guarantee
Adaptive	Increase $T$ if stuck	Best empirical performance

Adaptive cooling: Reheat if no improvement for many iterations.

;; Adaptive simulated annealing (reheat if stuck)
(define (adaptive-annealing objective initial-solution neighbor-fn
                            initial-temp max-iters)
  (let ((temperature initial-temp)
        (no-improvement-count 0)
        (current-solution initial-solution)
        (current-energy (objective initial-solution))
        (best-solution initial-solution)
        (best-energy current-energy))

    (for (iter (range 0 max-iters))
      ;; (Standard SA accept/reject logic here)
      ;; ...

      ;; Track stagnation
      (if (= best-energy current-energy)
          (set! no-improvement-count (+ no-improvement-count 1))
          (set! no-improvement-count 0))

      ;; Adaptive cooling
      (if (> no-improvement-count 50)
          (do
            ;; Stuck: reheat to escape
            (set! temperature (* temperature 1.2))
            (set! no-improvement-count 0))

          ;; Normal cooling
          (set! temperature (* temperature 0.95))))

    {:solution best-solution :energy best-energy}))

7.5 Grid Search and Bayesian Optimization

7.5.1 Grid Search: Exhaustive Exploration

Grid search: Evaluate objective at every point on a grid.

Example: SMA parameters

Fast: [5, 10, 15, 20, 25, …, 50] (10 values)
Slow: [20, 30, 40, …, 200] (19 values)
Total evaluations: 10 × 19 = 190

;; Grid search for strategy parameters
(define (grid-search objective param-grids)
  (let ((best-params null)
        (best-score -9999))

    ;; Nested loops over parameter grids (2D example)
    (for (p1 (nth param-grids 0))
      (for (p2 (nth param-grids 1))

        (let ((params [p1 p2])
              (score (objective params)))

          (if (> score best-score)
              (do
                (set! best-score score)
                (set! best-params params))
              null))))

    {:params best-params :score best-score}))

;; Example: SMA crossover
(define grid-result
  (grid-search
    (lambda (p) (sma-strategy-fitness p prices))
    [(range 5 51 5)      ;; Fast: 5, 10, 15, ..., 50
     (range 20 201 10)]))  ;; Slow: 20, 30, 40, ..., 200

;; Evaluates 10 × 19 = 190 parameter combinations

Curse of Dimensionality:

Grid search scales exponentially with dimensions:

Parameters	Values per Param	Total Evaluations
2	10	100
3	10	1,000
5	10	100,000
10	10	10,000,000,000

Grid search is infeasible beyond 3-4 parameters.

7.5.2 Random Search: Surprisingly Effective

Random search: Sample parameter values randomly.

Bergstra & Bengio (2012): “Random search is more efficient than grid search when only a few parameters matter.”

Intuition: If only 2 out of 10 parameters affect the objective, random search explores those 2 dimensions thoroughly, while grid search wastes evaluations on irrelevant dimensions.

;; Random search
(define (random-search objective param-ranges n-samples)
  (let ((best-params null)
        (best-score -9999))

    (for (i (range 0 n-samples))
      ;; Random sample from each parameter range
      (let ((params
             (map param-ranges
                  (lambda (range)
                    (+ (range :min)
                       (* (random) (- (range :max) (range :min))))))))

        (let ((score (objective params)))

          (if (> score best-score)
              (do
                (set! best-score score)
                (set! best-params params))
              null))))

    {:params best-params :score best-score}))

;; Example: 1000 random samples often beats 8000+ grid points
(define random-result
  (random-search
    (lambda (p) (sma-strategy-fitness p prices))
    param-ranges
    1000))

Grid vs Random:

Grid search: 10×10 = 100 evaluations (fixed grid)
Random search: 100 evaluations (random points)

If only 1 parameter matters:

Grid: Explores 10 distinct values of important param
Random: Explores ~100 distinct values of important param (much better!)

7.5.3 Bayesian Optimization: Smart Sampling

Problem: Random search wastes evaluations exploring bad regions.

Bayesian Optimization: Build a probabilistic model of the objective, use it to choose where to sample next.

Process:

Start with a few random samples
Fit a Gaussian Process (GP) to the observed points
Use acquisition function to choose next point (balance exploration/exploitation)
Evaluate objective at that point
Update GP, repeat

Acquisition function (Upper Confidence Bound):

$$\text{UCB}(x) = \mu(x) + \kappa \sigma(x)$$

where:

$\mu(x)$ = GP mean prediction at $x$ (expected value)
$\sigma(x)$ = GP standard deviation at $x$ (uncertainty)
$\kappa$ = exploration parameter (typically 2)

Intuition: Choose points with high predicted value ($\mu$) OR high uncertainty ($\sigma$).

;; Simplified Bayesian optimization (conceptual)
;; Real implementation requires GP library (scikit-optimize, GPyOpt)
(define (bayesian-optimization objective param-ranges n-iters)
  (let ((observations [])        ;; List of {:params, :score}
        (best-params null)
        (best-score -9999))

    ;; Phase 1: Random initialization (5 samples)
    (for (i (range 0 5))
      (let ((params (random-sample param-ranges)))
        (let ((score (objective params)))
          (set! observations (append observations {:params params :score score}))

          (if (> score best-score)
              (do
                (set! best-score score)
                (set! best-params params))
              null))))

    ;; Phase 2: Bayesian optimization (remaining iterations)
    (for (iter (range 0 (- n-iters 5)))
      ;; Fit GP to observations (conceptual—requires GP library)
      ;; gp = GaussianProcessRegressor.fit(X, y)

      ;; Choose next point via acquisition function (e.g., UCB)
      (let ((next-params (maximize-acquisition-function observations param-ranges)))

        (let ((score (objective next-params)))
          (set! observations (append observations
                                    {:params next-params :score score}))

          (if (> score best-score)
              (do
                (set! best-score score)
                (set! best-params next-params))
              null))))

    {:params best-params :score best-score}))

;; Acquisition: UCB (upper confidence bound)
;; In practice, use library to compute GP predictions
;; UCB(x) = μ(x) + κ*σ(x)  (high mean or high uncertainty)

Efficiency Comparison:

Method	Evaluations	Global Optimum	Parallelizable
Grid Search	10,000+	No	Yes
Random Search	1,000	Unlikely	Yes
Genetic Algorithm	5,000	Maybe	Partially
Simulated Annealing	2,000	Maybe	No
Bayesian Optimization	100-200	Likely	Limited

Bayesian optimization is 10-50x more sample-efficient than random search.

7.6 Constrained Optimization

7.6.1 Penalty Methods

Problem: Constraints complicate optimization.

Example: Portfolio weights must sum to 1 and be non-negative.

Penalty method: Convert constrained problem to unconstrained by adding penalties for constraint violations.

Unconstrained form:

$$\min_x f(x) + \mu \sum_i \max(0, g_i(x))^2$$

where:

$f(x)$ = original objective
$g_i(x) \leq 0$ = constraints
$\mu$ = penalty coefficient (large $\mu$ → strong enforcement)

;; Penalty method for portfolio optimization
(define (penalized-objective weights mu sigma risk-aversion penalty-coef)
  (let ((port-return (dot-product weights mu))
        (port-variance (quadratic-form weights sigma)))

    ;; Original objective: minimize risk - return
    (let ((objective (- port-variance (* risk-aversion port-return))))

      ;; Penalty 1: Weights must sum to 1
      (let ((sum-penalty (* penalty-coef (pow (- (sum weights) 1) 2)))

            ;; Penalty 2: Non-negative weights
            (neg-penalty (* penalty-coef
                           (sum (map weights
                                    (lambda (w)
                                      (pow (min 0 w) 2)))))))  ;; Penalty if w < 0

        (+ objective sum-penalty neg-penalty)))))

;; Optimize with increasing penalties
(define (optimize-portfolio-penalty mu sigma initial-weights)
  (let ((penalty-coef 1.0)
        (current-weights initial-weights))

    ;; Phase 1-5: Increase penalty gradually
    (for (phase (range 0 5))
      (set! penalty-coef (* penalty-coef 10))  ;; 1, 10, 100, 1000, 10000

      (let ((result
             (gradient-descent-nd
               (lambda (w) (penalized-objective w mu sigma 1.0 penalty-coef))
               (lambda (w) (numerical-gradient
                            (lambda (w2) (penalized-objective w2 mu sigma 1.0 penalty-coef))
                            w))
               current-weights
               0.01
               100
               0.001)))

        (set! current-weights (result :optimum))))

    current-weights))

7.6.2 Barrier Methods

Barrier method: Use logarithmic barriers to enforce constraints.

Form:

$$\min_x f(x) - \mu \sum_i \log(-g_i(x))$$

Barrier prevents $g_i(x) \to 0^+$ (approaching constraint boundary from inside).

;; Log barrier for non-negativity
(define (barrier-objective weights mu sigma barrier-coef)
  (let ((port-return (dot-product weights mu))
        (port-variance (quadratic-form weights sigma)))

    (let ((sharpe (/ port-return (sqrt port-variance))))

      ;; Barrier for w > 0: -log(w)
      ;; As w → 0, -log(w) → ∞ (infinite penalty)
      (let ((barrier-penalty
             (- (* barrier-coef
                  (sum (map weights (lambda (w) (log w))))))))

        (+ (- sharpe) barrier-penalty)))))

;; Optimize with decreasing barrier coefficient
;; Start with large barrier, gradually reduce

7.6.3 Lagrange Multipliers: Analytical Solution

For equality constraints, Lagrange multipliers provide analytical solutions.

Example: Minimum variance portfolio

$$\begin{aligned} \min_w \quad & w^T \Sigma w \ \text{s.t.} \quad & \mathbf{1}^T w = 1 \end{aligned}$$

Lagrangian:

$$\mathcal{L}(w, \lambda) = w^T \Sigma w + \lambda(\mathbf{1}^T w - 1)$$

Optimality conditions:

$$\nabla_w \mathcal{L} = 2\Sigma w + \lambda \mathbf{1} = 0$$

Solution:

$$w^* = -\frac{\lambda}{2} \Sigma^{-1} \mathbf{1}$$

Apply constraint $\mathbf{1}^T w = 1$:

$$\lambda = -\frac{2}{\mathbf{1}^T \Sigma^{-1} \mathbf{1}}$$

Final formula:

$$w^* = \frac{\Sigma^{-1} \mathbf{1}}{\mathbf{1}^T \Sigma^{-1} \mathbf{1}}$$

;; Minimum variance portfolio (analytical solution)
(define (minimum-variance-portfolio sigma)
  (let ((sigma-inv (matrix-inverse sigma))
        (ones (make-array (length sigma) 1)))

    (let ((sigma-inv-ones (matrix-vec-mult sigma-inv ones)))

      (let ((denom (dot-product ones sigma-inv-ones)))

        ;; w* = Σ⁻¹ 1 / (1^T Σ⁻¹ 1)
        (map sigma-inv-ones (lambda (x) (/ x denom)))))))

;; Example
(define min-var-w (minimum-variance-portfolio sigma))
;; Minimizes portfolio variance subject to full investment

7.7 Practical Applications

7.7.1 Walk-Forward Optimization

Problem: In-sample optimization overfits.

Solution: Walk-forward analysis simulates realistic deployment:

Train: Optimize parameters on historical window (e.g., 1 year)
Test: Apply optimized parameters to out-of-sample period (e.g., 1 month)
Roll forward: Shift window, repeat

;; Walk-forward optimization
;; Parameters:
;;   prices: full price history
;;   train-period: number of periods for training
;;   test-period: number of periods for testing
;;   param-ranges: parameter search space
(define (walk-forward-optimize prices train-period test-period param-ranges)
  (let ((n (length prices))
        (results []))

    ;; Rolling window: train, then test
    (for (i (range train-period (- n test-period) test-period))

      ;; Train on [i - train_period, i]
      (let ((train-prices (slice prices (- i train-period) i)))

        (let ((optimal-params
               (genetic-algorithm
                 (lambda (p) (sma-strategy-fitness p train-prices))
                 param-ranges
                 30    ;; Small population (fast)
                 50)))  ;; Few generations

          ;; Test on [i, i + test_period]
          (let ((test-prices (slice prices i (+ i test-period))))

            (let ((test-sharpe (sma-strategy-fitness optimal-params test-prices)))

              (set! results (append results
                                   {:train-end i
                                    :params optimal-params
                                    :test-sharpe test-sharpe})))))))

    results))

;; Aggregate out-of-sample performance
(define (walk-forward-summary results)
  {:avg-sharpe (average (map results (lambda (r) (r :test-sharpe))))
   :periods (length results)})

;; Example
(define wf-results
  (walk-forward-optimize historical-prices 252 21 param-ranges))

(define wf-summary (walk-forward-summary wf-results))
;; avg-sharpe: realistic estimate (not overfitted)

Walk-Forward vs In-Sample:

Metric	In-Sample Optimized	Walk-Forward
Sharpe Ratio	2.5 (optimistic)	1.2 (realistic)
Max Drawdown	15%	28%
Win Rate	65%	52%

In-sample results are always better—that’s overfitting!

7.7.2 Kelly Criterion: Optimal Position Sizing

Kelly Criterion: Maximize geometric growth rate by betting a fraction of capital proportional to edge.

Formula:

$$f^* = \frac{p \cdot b - q}{b}$$

where:

$p$ = win probability
$q = 1 - p$ = loss probability
$b$ = win/loss ratio (payoff when win / loss when lose)
$f^*$ = fraction of capital to risk

;; Kelly fraction calculation
(define (kelly-fraction win-prob win-loss-ratio)
  (let ((lose-prob (- 1 win-prob)))
    (/ (- (* win-prob win-loss-ratio) lose-prob)
       win-loss-ratio)))

;; Example: 55% win rate, 2:1 win/loss ratio
(define kelly-f (kelly-fraction 0.55 2.0))
;; → 0.325 (risk 32.5% of capital per trade)

;; Fractional Kelly (reduce risk)
(define (fractional-kelly win-prob win-loss-ratio fraction)
  (* fraction (kelly-fraction win-prob win-loss-ratio)))

;; Half-Kelly: more conservative
(define half-kelly (fractional-kelly 0.55 2.0 0.5))
;; → 0.1625 (risk 16.25%)

Why fractional Kelly?

Full Kelly maximizes growth but has high volatility. Half-Kelly sacrifices some growth for much lower drawdowns.

7.7.3 Transaction Cost Optimization

Problem: Optimal theoretical portfolio may be unprofitable after transaction costs.

Solution: Explicitly model costs in objective.

;; Portfolio optimization with transaction costs
(define (portfolio-with-costs mu sigma current-weights transaction-cost-rate)
  ;; Optimal weights (ignoring costs)
  (let ((optimal-weights (markowitz-portfolio mu sigma 2.0)))

    ;; Calculate turnover (sum of absolute changes)
    (let ((turnover
           (sum (map (range 0 (length current-weights))
                    (lambda (i)
                      (abs (- (nth optimal-weights i)
                             (nth current-weights i))))))))

      ;; Expected return after costs
      (let ((gross-return (dot-product optimal-weights mu))
            (costs (* transaction-cost-rate turnover)))

        ;; Only rebalance if net return > current return
        (if (> (- gross-return costs)
              (dot-product current-weights mu))
            optimal-weights  ;; Rebalance
            current-weights))))) ;; Don't trade

;; Example
(define current-w [0.3 0.4 0.3])
(define new-w (portfolio-with-costs mu sigma current-w 0.001))
;; If turnover*0.1% < gain, rebalance; else hold

7.8 Key Takeaways

Algorithm Selection Framework:

Problem Type	Recommended Method	Rationale
Smooth, differentiable	Adam optimizer	Fast, robust
Convex (portfolio)	QP solver (CVXPY)	Guaranteed global optimum
Discrete parameters	Genetic algorithm	Handles integers naturally
Expensive objective	Bayesian optimization	Sample-efficient (100 evals)
Multi-objective	NSGA-II	Finds Pareto frontier
Non-smooth	Simulated annealing	Escapes local minima

Common Pitfalls:

Overfitting: In-sample optimization ≠ future performance → use walk-forward
Ignoring costs: Theoretical optimum may be unprofitable after fees
Parameter instability: Optimal parameters change over time → re-optimize periodically
Curse of dimensionality: Grid search fails beyond 3 parameters → use Bayesian optimization
Local optima: Gradient descent gets stuck → use multiple random starts or GA

Performance Benchmarks (2-parameter problem):

Method	Time	Evaluations
Grid Search	1 min	10,000
Random Search	5 sec	1,000
Genetic Algorithm	30 sec	5,000
Simulated Annealing	20 sec	2,000
Bayesian Optimization	20 sec	100
Gradient Descent	1 sec	50

For 10-parameter problem:

Method	Time
Grid Search	10 hours (10^10 evals)
Random Search	50 sec
Genetic Algorithm	5 min
Bayesian Optimization	3 min

Next Steps:

Chapter 8 (Risk Management) uses these optimization techniques to:

Optimize position sizes (Kelly criterion)
Minimize portfolio drawdown (constrained optimization)
Calibrate risk models (maximum likelihood estimation)

The optimization skills you’ve learned here are foundational—every quantitative trading decision involves optimization.

Algorithmic Trading with Solisp

Chapter 7: Optimization Algorithms

Introduction: The Parameter Tuning Problem

7.1 Gradient Descent: Following the Slope

7.1.1 The Hiking Analogy

7.1.2 Mathematical Formulation

7.1.3 Multi-Dimensional Gradient Descent

7.1.4 Momentum: Accelerating Convergence

7.1.5 Adam: Adaptive Learning Rates

7.2 Convex Optimization: Guaranteed Global Optima

7.2.1 Why Convexity Matters

7.2.2 Linear Programming (LP)

7.2.3 Quadratic Programming (QP): Markowitz Portfolio

7.3 Genetic Algorithms: Evolution-Inspired Optimization

7.3.1 When Gradients Fail

7.3.2 The Evolution Analogy

7.3.3 Implementation

7.3.4 Selection Methods

7.3.5 Crossover and Mutation

7.3.6 Example: SMA Crossover Strategy Optimization

7.4 Simulated Annealing: Escaping Local Optima

7.4.1 The Metallurgy Analogy

7.4.2 The Algorithm

7.4.3 Neighbor Generation

7.4.4 Cooling Schedules

7.5 Grid Search and Bayesian Optimization

7.5.1 Grid Search: Exhaustive Exploration

7.5.2 Random Search: Surprisingly Effective

7.5.3 Bayesian Optimization: Smart Sampling

7.6 Constrained Optimization

7.6.1 Penalty Methods

7.6.2 Barrier Methods

7.6.3 Lagrange Multipliers: Analytical Solution

7.7 Practical Applications

7.7.1 Walk-Forward Optimization

7.7.2 Kelly Criterion: Optimal Position Sizing

7.7.3 Transaction Cost Optimization

7.8 Key Takeaways

Further Reading

Keyboard shortcuts

Algorithmic Trading with Solisp