SVM

For details about SVM, please read Andew Ng's notes or Hang Li's statistical learning book in Chinese.

Why do we solve dual problem not prime problem in SVM?

  • The complexity of the optimization algorithm is related the dimensionality DD in the prime problem; in the dual problem, such complexity is related to the number of samples in training data set, NN. If is's a linear classification, and DND \leq N , we could solve the prime problem directly. or SGD.
  • Dual problem is easier to optimize when the data is nonlieaner inseparable, We could apply kennel trick.
  • When predicting, if optimizing dual problem we get aia_i, then the prediction can be written as: wTx+b=(iaiyixi)Txj+b=iaiyi<xi,xj>+b(1) w^Tx+b = (\sum_i a_i y_i x_i)^Tx_j+b = \sum_i a_i y_i < x_i, x_j> + b \:\:\:\:(1) That means we only need to calculate the inner product between test data and training data. Moreover, some aia_i's will be zero if they are not support vectors.

[1] Andew Ng's note

Why use kernel function in SVM ?

The kernel trick avoids the explicit mapping that is needed to get linear learning algorithms to learn a nonlinear function or decision boundar In SVM, if we want to map our features into higher dimension, then we need to find a mapping functionϕ\phi to do that. Then in that higher dimension feature space, we find a linear boundary to classify the data. For our dual problem, we have inner product format in function (1), then we define the kernel as K(x,z)=ϕ(x)Tϕ(z)K(x,z)= \phi(x)^T\phi(z). The interesting is that, often K(x,z) may be very inexpensive to calculate, even though ϕ(x)\phi(x) itself may be very expensive to calculate(Perhaps because it's an extremely high dimensional vector). If we get an kernel functionK(x,z), then we can get SVM to learn in ghih dimensional feature space given by ϕ\phi, but whitough even having to explicitly find or represent vectors ϕ(x)\phi(x)

[1] Andrew Ng's page 14-15

Advantage and disadvantage of SVM

Advantage

  • better performance for small size of data.
  • good generalization
  • no local minimal(it's a convex optimization, local optimum means global optimum??)
  • handle high-demenisonal data
  • have a solid theory to support.

Disadvantage:

  • Kernel selection and parameter tuning.
  • Originally for binary classification, not for multi-classification
  • High algorithmic complexity and extensive memory requirement required by quadratic programming in large-scale tasks.

What's SMO(sequential minimal optimization)?

SMO is an iterative algorithm for solving SVM dual problem. SMO breaks this problem into a series of smallest possible sub-problems, which are then solved analytically. Because of the linear equality constraint involving the Lagrange multipliers αi\alpha_i, the smallest possible problem involves two such multipliers. Then, for any two multipliers α1\alpha_{1} and α2\alpha_{2} , the constraints are reduced to: e reduced to:

0α1,α2C 0 \leq \alpha_1, \alpha_2 \leq C

y1α1+y2α2=k y_1\alpha_1+y2\alpha_2 = k and this reduced problem can be solved analytically: one needs to find a minimum of a one-dimensional quadratic function. kk is the negative of the sum over the rest of terms in the equality constraint in the dual problem, which is a constant.

Steps:

  1. Find a Lagrange multiplier α1\alpha_1 that violates the KKT conditions for the optimization problem.
  2. Pick a second multiplier α2\alpha_2 and optimize the pair (α1,α2)(\alpha_1, \alpha_2) (pick two that allow us to make the biggest progress towards the global maximum)
  3. optimize W(a)W(a) with respect to α1\alpha_1 and α2\alpha_2, while holding all the other αk\alpha_k's(k1,2)(k \neq 1,2) fixed.
  4. Repeats steps 1 and 2 until converge.

[1]http://cs229.stanford.edu/notes/cs229-notes3.pdf

[2]https://en.wikipedia.org/wiki/Support_vector_machine

results matching ""

    No results matching ""