SVM
For details about SVM, please read Andew Ng's notes or Hang Li's statistical learning book in Chinese.
Why do we solve dual problem not prime problem in SVM?
- The complexity of the optimization algorithm is related the dimensionality in the prime problem; in the dual problem, such complexity is related to the number of samples in training data set, . If is's a linear classification, and , we could solve the prime problem directly. or SGD.
- Dual problem is easier to optimize when the data is nonlieaner inseparable, We could apply kennel trick.
- When predicting, if optimizing dual problem we get , then the prediction can be written as: That means we only need to calculate the inner product between test data and training data. Moreover, some 's will be zero if they are not support vectors.
[1] Andew Ng's note
Why use kernel function in SVM ?
The kernel trick avoids the explicit mapping that is needed to get linear learning algorithms to learn a nonlinear function or decision boundar In SVM, if we want to map our features into higher dimension, then we need to find a mapping function to do that. Then in that higher dimension feature space, we find a linear boundary to classify the data. For our dual problem, we have inner product format in function (1), then we define the kernel as . The interesting is that, often K(x,z) may be very inexpensive to calculate, even though itself may be very expensive to calculate(Perhaps because it's an extremely high dimensional vector). If we get an kernel functionK(x,z), then we can get SVM to learn in ghih dimensional feature space given by , but whitough even having to explicitly find or represent vectors
Advantage and disadvantage of SVM
Advantage
- better performance for small size of data.
- good generalization
- no local minimal(it's a convex optimization, local optimum means global optimum??)
- handle high-demenisonal data
- have a solid theory to support.
Disadvantage:
- Kernel selection and parameter tuning.
- Originally for binary classification, not for multi-classification
- High algorithmic complexity and extensive memory requirement required by quadratic programming in large-scale tasks.
What's SMO(sequential minimal optimization)?
SMO is an iterative algorithm for solving SVM dual problem. SMO breaks this problem into a series of smallest possible sub-problems, which are then solved analytically. Because of the linear equality constraint involving the Lagrange multipliers , the smallest possible problem involves two such multipliers. Then, for any two multipliers and , the constraints are reduced to: e reduced to:
and this reduced problem can be solved analytically: one needs to find a minimum of a one-dimensional quadratic function. is the negative of the sum over the rest of terms in the equality constraint in the dual problem, which is a constant.
Steps:
- Find a Lagrange multiplier that violates the KKT conditions for the optimization problem.
- Pick a second multiplier and optimize the pair (pick two that allow us to make the biggest progress towards the global maximum)
- optimize with respect to and , while holding all the other 's fixed.
- Repeats steps 1 and 2 until converge.