Logistic Reggreesion
Logistic function or Sigmoid function:
g(x)=1+e−x1, which will take any x∈R, and output a value between 0 and 1.
Derivative of sigmoid function:
g′=g(x)∗(1−g(x))
If we have t=β0+β1x, then the logistic function can be written as: F(x)=1+e−(β0+β1x)1. Note that the F(x) is interpreted as the probability of the dependent variable equaling a "success" or "case" rather than a failure or non-case.
[1] Reference In Chinese
Why logistic regression is a linear model?
The logit of a number p between 0 and 1 is given by logit(p)=ln(1−pp). Therefore the logit of F(x) can be written as
logit(F(x))=ln((1−F(x))F(x))=β0+β1x....
It's called a generalized linear model not because the estimated probability of the response event is linear, but because the logit of the estimated probability response is a linear function of parameters.
More generally, the Generalized Linear Model is of the form
g(μi)=β0+β1x+β2x2...... where μ is the expected value of the response given the covariates.
[1]https://en.wikipedia.org/wiki/Logit
[2]https://en.wikipedia.org/wiki/Logistic_regression
[3]https://stats.stackexchange.com/questions/88603/why-is-logistic-regression-a-linear-model
Why isn't logistic regression called logistic classification?
Logistic regression was called regression long before terms like supervised learning come along and regreresion does not, in general, imply continuous outcomes.
Logistic regression is not a classification algorithm on its own. It's only a classification algorithm in combination with a decision rule (i.e., if g(x)>0.5 then y=1 else y=0) that makes LR as a classifier.
In addition, logistic regreesion technically is on continuous variable: It is regression on the logit of the event, that is log(1−pp), where p is the probability of the event. Logistic regression is a regression model because it estimates the probability of class membership as a (transformation of a) multilinear function of the features.
[1] https://www.quora.com/Why-is-logistic-regression-called-regression-if-it-doesnt-model-continuous-outcomes
[2] https://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-logistic-classification
Why logistic regression use sigmoid function? Can we use others?
Yes, we can use others. any non-linear function will do. But sigmoid function has some beautiful properties.
- Sigmoid function is founded between 0 and 1, and its derivative is easy to calculate.
- It's a simple way to introduce non-linearity to the model
Other motivations:
Sigmoid outputs the conditional prob of the prediction. The "odds ratio" is defined as 1−pp, the ratio between the prob an event occurs and the prob the event doesn't occur. If you take the natural log of this odds' ratio, we get logit function
logit(p(y=1∣X))=log(1−pp)
Let's use this log-transformation to model the relationship between our variable and the target variable,
logit(p(y=1∣X))=log(1−pp)=β0+β1x1+....
We're interested in p(y=1∣X), then take the inverse of this logit function , we get the logistic sigmoid
logit−1(p(y=1∣x))=1+e−β0+β1x1+....1
[1] https://www.quora.com/Logistic-Regression-Why-sigmoid-function
How to update(estimate) logistic regression parameters?[THIS IS LONG]
Sort answer: MLE, SDG, L-BFGS, ADMM....???
Long answer:
Suppose the samples are generated from bernoulli process and the results(0/1) will follow Bernoulli distribution. Suppose the prob of 1 is hθ(x), then prob of 0 is 1-hθ(x).
For the i-th sample, the probability can be written as:
P(y(i)=1∣x(i);θ)=hθ(x(i)) P(y(i)=0∣x(i);θ)=1−hθ(x(i))
Then combine them together, the prob of correct prediction on i-th sample is:
P(y(i)∣x(i);θ)=(hθ(x(i))y(i))(1−hθ(x(i)))1−y(i)
Since we assume that all the samples are generated IID, then for all N samples, the probability distribution can be expressed as:
P(Y∣X;θ)=i=1∏N(hθ(x(i))y(i)(1−hθ(x(i))1−y(i)))
L(θ)=P(Y∣X;θ)
To estimate the parameters θ, we take the log of the above expression, then we get log-liklihood function:
l(θ)=i=1∑Nlogl(θ)=i=1∑Ny(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))
You might feel familiar with this. Right! Maximizing the log likelihood is actually minimizing the cross entropy error.
For now, we just ignore the summation term and consider very single parameter θj, we take the derivative with respective to θj
∂θj∂l(θ)=(y(i)hθ(x)1−(1−y(i))1−hθ(x)1)∂θj∂hθ(x)
=(hθ(x)(1−hθ(x))y(i)(1−hθ(x))−(1−y(i))hθ(x))hθ(x)(1−hθ(x))∂θj∂θTx
=(y(i)−hθ(x))xj
Then, we can update the parameters by SGD:
θj:=θj+a(y(i)−hθ(x(i)))xj(i)
Solvers:
Given the logistic loss function(-log(x)):
minJ(w)=min−m1[i=1∑myiloghw(xi)+(1−yi)log(1−hw(xi))]
We use g and H to denote 1st order gradient and Hessian matrix. For a given sample yi, we have
gj=∂wj∂J(w)=hw(x(i))y(i)hw(x(i))(1−hw(x(i)))(−xj(i))+(1−y(i))1−hw(x(i))1hw(x(i))(1−hw(x(i)))xj(i)=(y(i)−hw(x(i)))x(i)
Hmn=∂wm∂wn∂2J(w)=hw(x(i))(1−hw(x(i)))xm(i)xn(i)
SGD
We use the gradient to find the direction to reduce the loss, wjk+1=wjk+αgj. k is the iteration number. After each update, we can compare J(wk+1)−J(wk) or ∣∣wk+1−wk∣∣ with some threshold ϵ to determine when to stop.
Newtown
The basic idea of Newtown method is to do second order Taylor expansion of f(x) around current local optima value, to get the estimates of next optimal value.
Suppose wkis the current minimal value,φ(w)=J(wk)+J′(wk)(w−wk)+21J′′(wk)(w−wk)2
Let φ′(w)=0, then we get w=wk−J′′(wk)J′(wk), then we have the update rule:
wk+1=wk−J′′(wk)J′(wk)=wk−Hk−1⋅gk
In this method, we need a threshold ϵ, when ∣∣gk∣∣<ϵ, stop updates. In this method, we also require that the 2nd order derivative of objective function J(w) exists.
BFGS
Issues with Newtown:
- Hk−1 is expensive, computing intensive.
- Hk−1 might be not semi-definite
Therefore, we need to use other quasi-newton methods, which is to estimate Hk−1 matrix.
Secant condition
After k+1 iterations, Taylor expansion around xk+1 is,
f(x)≈f(x)+f(xk+1)′(x−xk+1)+21f(xk+1)′′(x−xk+1)2
Take a derivative,
f′(x)=f′(xk+1)+Hk+1(x−xk+1)
let x=xk, then we have
gk+1−gk≈Hk+1(xk+1−xk)
Let sk=xk−xk+1 and yk=gk+1−gk, then we have the secant condition as:
yk=Hk+1sk(1)
Method
To estimate Hk−1, we use Bk≈Hk, then we hope that we can use the following rule to update:
Bk+1=Bk+ΔBk(2)
We always set B0 as unit matrix I, then we suppose the structure of ΔBk as follows:
ΔBk=αuuT+βvvT
According to the secant condition (1), we have the following function,
yk=Bksk+(αuTsk)u+(βvTsk)v
let αuTsk=1 and βvTsk=1, Choosing u=yk and v=Bksk, we can obtain:
α=ykTsk1,β=−skTBksk1
Finally, we substitute α and β into (2) and get the update equation of Bk+1
Bk+1=Bk+ykTskykykT−skTBkskBkskskTBkT
Algorithm
- Initialize x0 and threshold ϵ, let B0=I, k=0.
- Obtain the direction, dk=−Bk−1gk.
- Perform a one-dimensional optimizaiton (linear search) to find an acceptable stepsize, λk, so λk=λargminf(xk+λ∗dk)
- Set sk=λk∗dk, xk+1=xk+sk.
- If ∣∣gk+1∣∣<ϵ, algorithm terminates.
- Update yk=gk+1−gk
- Update Bk+1=Bk+ykTskykykT−skTBkskBkskskTBkT
Note: in step 2, we have to calculate Bk−1, which can be done by Sherman-Morrison formula.
[1]Hessian Matrix of Logistic function
[2]BFGS wiki
Can logistic regression work on data that may not be separable by a linear boundary?
Yes, kernel trick, project data to higher dimension feature space, which might have a hyperplane to separate the data.
What's the relationship between logistic regression and gaussian naive bayes?
Under some condition, they learn the same model.
In gaussian naive bayes, Posterior is :
p(y∣x)=∑P(x∣y)P(y)P(x∣y)P(y)
Generally, we assume that P(x∣y) is gaussian distribution, and P(y) is polynomial distribution, then the parameters could be estimated by MLE. If we only consider a binary classification problem, then the log of odd's ratio can be written by:
logP(y=0∣x)P(y=1∣x)=logP(x∣y=0)P(x∣y=1)+logP(y=0)P(y=1)=−2σ12(x−μ1)2+2σ02(x−μ0)2+θ0
if we assume σ1=σ0, then denominator would be cancelled out, we will get:
logP(y=0∣x)P(y=1∣x)=θTx
Furthermore, we will get:
P(y=1∣x)=1+eeTxeθTx=1+e−θTx1
This is the same as logistic regression.
[1] Chinese reference
How to apply logistic regression on multi-classification problem?
- Build binary classifier for each category, one-verus- all
- Softmax
A set of independent binary regressions
One fairly simple way to arrive at the multinomial logit model is to imagine, for K possible outcomes, running K-1 independent binary logistic regression models, in which one outcome is chosen as a "pivot" and then the other K-1 outcomes are separately regressed against the pivot outcome.
lnP(Yi=K)Pr(Yi=1)=β1Xi
lnP(Yi=K)Pr(Yi=2)=β2Xi......lnP(Yi=K)Pr(Yi=K−1)=βK−1Xi
Note that we have introduced separate sets of regression coefficients, one for each possible outcome. If we exponentiate both sides, and solve for the probabilities, we get:
Pr(Yi=1)=Pr(Yi=K)eβ1XiPr(Yi=2)=Pr(Yi=K)eβ2Xi......Pr(Yi=K−1)=Pr(Yi=K)eβK−1Xi
Using the fact that all K of the probabilities must sum to one, we find:
Pr(Yi=K)=1−k=1∑K−1Pr(Yi=K)eβkXi⇒Pr(Yi=K)=1+∑k=1K−1eβkXi1
We can use this to find the other probabilities:
Pr(Yi=1)=1+∑k=1K−1eβkXieβ1XiPr(Yi=2)=1+∑k=1K−1eβkXieβ2Xi......Pr(Yi=K−1)=1+∑k=1K−1eβkXieβK−1Xi
The fact that we run multiple regressions reveals why the model relies on the assumption of independence of irrelevant alternatives described above.
[1] https://en.wikipedia.org/wiki/Multinomial_logistic_regression
Sigmox
More:
Properties of Softmax:
- The calculated probabilities will be in the range of 0 to 1.
- The sum of all the probabilities is equals to 1.
When using Softmax, we have the following probability model
P(y=i∣x,θ)=∑jKeθjTxeθiTx
We use the following rule to classify our prediction:
y∗=argmaxiP(y=i∣x,θ)
The loss function is
J(θ)=−N1i∑Nj∑KP(yi=j)log∑eθkTxeθiTx
Compare logistic regression with SVM
Similarity:
- classification algo
- supervised learning
- discriminative model
- both can be used for non-leaner classification by kernel trick
- both objective is to find a hyperplane
- 都能减少离群点的影响?? I don't understand
Difference:
- loss functions: SVM: hinge loss, logistic regression: cross-entropy loss
- when estimating the parameters, all samples get involved in LR while SVM only use the support vector samples. Since all samples are used for estimating parameters, we need to consider imbalance issue for each category.
- LR model probability, SVM model hyperplan
- LR is statistical method, SVM is geometrical method
- LR avoid the effects of samples which are far away from the separating plane, while SVM only use support vector to mitigate the effects of samples which are far from hyperplane
- SVM is more sensitive to outliers since it only require support vectors for training while LR uses all training samples.
[1] Chinese reference1
[2] Chinese reference2 浅析Logistics Regression
Compare logistic regression with Naive bayes
Similarity:
- classification algo
- supervised learning
- when the conditional probabilityP(X∣Y=ck) in naive Bayes is assumed to be Gaussian IID, then the expression of P(Y=1∣X) is the same as logistic regression.
Difference:
- Logistctic regression is discriminative model, it learns P(y∣x) while Naive Bayes is Discriminative, it learns P(x,y).
- The former need iterative calculation, the latter one doesn't
- When data is limited, Naive Bayes is better then Logistic regression; Otherwise, Logistic regression is better than Naive Bayes when large amount of data is available.
- Since Naive Bayes assumes Gaussian IID P(X∣y), which means features are independent; if this does not hold, then Naive Bayes is not better than LR.
[1]Chinese reference. 浅析Logistics Regression