User:Elmackev/sandbox

This is the user sandbox of Elmackev. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

Regularization Perspectives on SVM[edit]

Support vector machines (SVM), like regularized least squares, are a special case of Tikhonov regularization. In the case of SVM, the loss function is the hinge loss.^[1]^[2]^[3]^[4]

Background[edit]

In the supervised learning framework, an algorithm is a strategy for choosing a function $f:\mathbf {X} \to \mathbf {Y}$ given a training set $S=\{(x_{1},y_{1}),\ldots ,(x_{n},y_{n})\}$ of inputs and their labels (the labels are usually $\pm 1$ ). Regularization strategies avoid overfitting by choosing a function that fits the data, but is not too complex. Specifically:

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{{\frac {1}{n}}\sum _{i=1}^{n}V(y_{i},f(x_{i}))+\lambda ||f||_{\mathcal {H}}^{2}\right\}$ ,

where ${\mathcal {H}}$ is a hypothesis space^[5] of functions, $V:\mathbf {Y} \times \mathbf {Y} \to \mathbb {R}$ is the loss function, $||\cdot ||_{\mathcal {H}}$ is a norm on the hypothesis space of functions, and $\lambda \in \mathbb {R}$ is the regularization parameter^[6] .

When ${\mathcal {H}}$ is a reproducing kernel Hilbert space, there exists a kernel function $K:\mathbf {X} \times \mathbf {X} \to \mathbb {R}$ that can be written as an $n\times n$ symmetric positive definite matrix $\mathbf {K}$ . By the representer theorem^[7], $f(x_{i})=\sum _{f=1}^{n}c_{j}\mathbf {K} _{ij}$ , and $||f||_{\mathcal {H}}^{2}=\langle f,f\rangle _{\mathcal {H}}=\sum _{i=1}^{n}\sum _{j=1}^{n}c_{i}c_{j}K(x_{i},x_{j})=c^{T}\mathbf {K} c$

Hinge loss[edit]

The simplest and most intuitive loss function for categorization is the misclassification loss, or 0-1 loss, which is 0 if $f(x_{i})=y_{i}$ and 1 if $f(x_{i})\neq y_{i}$ , i.e the heaviside step function on $-y_{i}f(x_{i})$ . However, this loss function is not convex, which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0-1 loss. The hinge loss, $V(y_{i},f(x_{i}))=(1-yf(x))_{+}$ where $(s)_{+}=max(s,0)$ , provides such a convex relaxation. In fact, the hinge loss is the tightest convex upper bound to the 0-1 misclassification loss function^[8], and with infinite data returns the Bayes optimal solution:^[9]

$f_{b}(x)=\left\{{\begin{matrix}1&p(1|x)>p(-1|x)\\-1&p(1|x)<p(-1|x)\end{matrix}}\right.$

Derivation^[10][edit]

With the hinge loss, $V(y_{i},f(x_{i}))=(1-yf(x))_{+}$ where $(s)_{+}=max(s,0)$ , the regularization problem becomes:

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{{\frac {1}{n}}\sum _{i=1}^{n}(1-yf(x))_{+}+\lambda ||f||_{\mathcal {H}}^{2}\right\}$ ,

In most of the SVM literature, this is written equivalently $\left({\text{take }}C={\frac {1}{2\lambda n}}\right)$ as:

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{C\sum _{i=1}^{n}(1-yf(x))_{+}+{\frac {1}{2}}||f||_{\mathcal {H}}^{2}\right\}$ .

This problem is non-differentiable because of the "kink" in the loss function. However, we can rewrite it using slack variables $\xi _{i}$ :

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{C\sum _{i=1}^{n}\xi _{i}+{\frac {1}{2}}||f||_{\mathcal {H}}^{2}\right\}$ subject to: ${\begin{aligned}\xi _{i}\geq 1-y_{i}f(x_{i}):\ \ \ &i=1,\ldots ,n\\\xi _{i}\geq 0:\ \ \ &i=1,\ldots ,n\end{aligned}}$

Next we apply the representer theorem to get:

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{C\sum _{i=1}^{n}\xi _{i}+{\frac {1}{2}}c^{T}\mathbf {K} c\right\}$ subject to: ${\begin{aligned}\xi _{i}\geq 1-y_{i}\sum _{j=1}^{n}c_{j}K(x_{i},x_{j}):\ \ \ &i=1,\ldots ,n\\\xi _{i}\geq 0:\ \ \ &i=1,\ldots ,n\end{aligned}}$

This is a constrained optimization problem, which we will solve using the Lagrangian to derive the dual problem. The Lagrangian is:

$L(c,\xi ,\alpha ,\zeta )=C\sum _{i=1}^{n}\xi _{i}+{\frac {1}{2}}c^{T}\mathbf {K} c-\sum _{i=1}^{n}\alpha _{i}\left(y_{i}\left\{\sum _{j=1}^{n}c_{j}K(x_{i},x_{j})\right\}-1-\xi _{i}\right)-\sum _{i=1}^{n}\zeta _{i}\xi _{i}$

The dual problem is:

${\text{arg}}\min _{\alpha ,\zeta >0}\inf _{c,\xi }L(c,\xi ,\alpha ,\zeta )$

Minimizing $L$ with respect to $c_{i}$ : ${\frac {\partial L}{\partial c_{i}}}=0\Rightarrow c_{i}=\alpha _{i}y_{i}$ Minimizing $L$ with respect to $\xi _{i}$ : ${\frac {\partial L}{\partial \xi _{i}}}=0\Rightarrow C-\alpha _{i}-\zeta _{i}=0\Rightarrow 0\leq \alpha _{i}\leq C$

Then, plugging $\zeta _{i}=C-\alpha _{i}$ into the Lagrangian, we can write the dual problem as: ${\text{arg}}\max _{\alpha \geq 0}\inf L(c,\alpha )-{\frac {1}{2}}c^{T}\mathbf {K} c+\sum _{i=1}^{n}\alpha _{i}\left(1-y_{i}\sum _{j=1}^{n}K(x_{i},x_{j})c_{j}\right)$

Then, plugging in $c_{i}=\alpha _{i}y_{i}$ , we get: ${\text{arg}}\max _{\alpha \in \mathbb {R} ^{n}}L(\alpha )={\text{arg}}\max _{\alpha \in \mathbb {R} ^{n}}\sum _{i=1}^{n}\alpha _{i}-{\frac {1}{2}}\sum _{i,j=1}^{n}\alpha _{i}y_{i}K(x_{i},x_{j})\alpha _{j}y_{j}={\text{arg}}\max _{\alpha \in \mathbb {R} ^{n}}\sum _{i=1}^{n}\alpha _{i}-{\frac {1}{2}}\alpha ^{T}({\text{diag}}\mathbf {Y} )\mathbf {K} ({\text{diag}}\mathbf {Y} )\alpha$

Subject to $0\leq \alpha _{i}\leq C\ \ \ i=1,\ldots ,n$

Note that this dual problem is easier to solve than the original problem because it is box constrained (the $\alpha _{i}$ are bounded). Also notice that the slack variables have disappeared in the dual problem.

Consequences and interpretations[edit]

The Karush-Kuhn-Tucker conditions dictate that all optimal solutions must satisfy the following conditions for $i=1,\ldots ,n$ :

$\sum _{j=1}^{n}c_{j}K(x_{i},x_{j})-\sum _{j=1}^{n}y_{i}\alpha _{j}K(x_{i},x_{j})=0$

$C-\alpha _{i}-\zeta _{i}=0$

$y_{i}\left(\sum _{j=1}^{n}y_{j}\alpha _{j}K(x_{i},x_{j})\right)-1+\xi _{i}\geq 0$

$\alpha _{i}\left[y_{i}\left(\sum _{j=1}^{n}y_{j}\alpha _{j}K(x_{i},x_{j})\right)-1+\xi _{i}\right]=0$

$\zeta _{i}\xi _{i}=0$

$\xi _{i},\alpha _{i},\zeta _{i}\geq 0$

From these above constraints, and recalling that $f(x)=\sum _{i=1}^{n}y_{i}\alpha _{i}K(x,x_{i})$ , we can derive conditions relating the $\alpha _{i}$ to $y_{i}f(x_{i})$ ^[11] :

${\begin{aligned}y_{i}f(x_{i})>1&\Rightarrow (1-y_{i}f(x_{i}))<0\\&\Rightarrow \xi _{i}\neq (1-y_{i}f(x_{i}))\\&\Rightarrow \alpha _{i}=0\end{aligned}}$

${\begin{aligned}y_{i}f(x_{i})<1&\Rightarrow (1-y_{i}f(x_{i}))>0\\&\Rightarrow \xi _{i}>0\\&\Rightarrow \zeta _{i}=0\\&\Rightarrow \alpha _{i}=C\end{aligned}}$

${\begin{aligned}\alpha _{i}=C&\Rightarrow \xi _{i}=1-y_{i}f(x_{i})\\&\Rightarrow y_{i}f(x_{i})\leq 1\end{aligned}}$

${\begin{aligned}\alpha _{i}=0&\Rightarrow C=\zeta _{i}\\&\Rightarrow \xi _{i}=0\\&\Rightarrow \\&\Rightarrow y_{i}f(x_{i})\geq 1\end{aligned}}$

${\begin{aligned}0<\alpha _{i}<C&\Rightarrow \zeta _{i}\neq 0\\&\Rightarrow \xi _{i}=0\\&\Rightarrow y_{i}f(x_{i})=1\end{aligned}}$

Note that the solution is relatively sparse, because whenever $y_{i}f(x_{i})>1,\ \alpha _{i}=0$ . In SVM, the input points with non-zero coefficients are called support vectors. Given the above constraints, the support vectors are precisely the input points where $y_{i}f(x_{i})\leq 1$ . ${\begin{aligned}\end{aligned}}$

Notes[edit]

^ Rosasco, Lorenzo. "Regularized Least-Squares and Support Vector Machines" (PDF).,
^ Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).
^ Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi:10.1198/016214504000000098. {{cite journal}}: Check date values in: |year= / |date= mismatch (help)
^ Rosasco, Lorenzo; Vito, Ernesto De; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: date and year (link)
^ This hypothesis space of functions is a Hilbert space of all the functions we're allowing the algorithm to pick
^ For insight on choosing the parameter, see, e.g., Wahba, Grace; Wang, Yonghua (1990). "When is the optimal regularization parameter insensitive to the choice of the loss function". Communications in Statistics - Theory and Methods. 19 (5): 1685–1700. doi:10.1080/03610929008830285.{{cite journal}}: CS1 maint: date and year (link)
^ See Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alex J. (2001). "A Generalized Representer Theorem". Computational Learning Theory: Lecture Notes in Computer Science. Lecture Notes in Computer Science. 2111: 416–426. doi:10.1007/3-540-44581-1_27. ISBN 978-3-540-42343-0.{{cite journal}}: CS1 maint: date and year (link)
^ Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi:10.1198/016214504000000098. {{cite journal}}: Check date values in: |year= / |date= mismatch (help)
^ Rosasco, Lorenzo; Vito, Ernesto De; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: date and year (link)
^ For a detailed derivation, see Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).
^ For more detail, see Rosasco, Lorenzo. "Regularized Least Squares and Support Vector Machines" (PDF).

References[edit]

Evgeniou, Theodoros; Pontil, Massimiliano; Poggio, Tomaso (2000). "Regularization Networks and Support Vector Machines" (PDF). Advances in Computational Mathematics. 13 (1): 1–50. doi:10.1023/A:1018946025316.{{cite journal}}: CS1 maint: date and year (link)

Joachims, Thorsten. "SVMlight".

Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi:10.1198/016214504000000098. {{cite journal}}: Check date values in: |year= / |date= mismatch (help)

Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).

Rosasco, Lorenzo; Vito, Ernesto De; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: date and year (link)

Rosasco, Lorenzo. "Regularized Least-Squares and Support Vector Machines" (PDF).

Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alex J. (2001). "A Generalized Representer Theorem". Computational Learning Theory: Lecture Notes in Computer Science. Lecture Notes in Computer Science. 2111: 416–426. doi:10.1007/3-540-44581-1_27. ISBN 978-3-540-42343-0.{{cite journal}}: CS1 maint: date and year (link)

Vapnik, Vladimir (1999). The Nature of Statistical Learning Theory. New York: Springer-Verlag. ISBN 0-387-98780-0.

Wahba, Grace; Wang, Yonghua (1990). "When is the optimal regularization parameter insensitive to the choice of the loss function". Communications in Statistics - Theory and Methods. 19 (5): 1685–1700. doi:10.1080/03610929008830285.{{cite journal}}: CS1 maint: date and year (link)

[1] Rosasco, Lorenzo. "Regularized Least-Squares and Support Vector Machines" (PDF).,

[2] Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).

[3] Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi:10.1198/016214504000000098. {{cite journal}}: Check date values in: |year= / |date= mismatch (help)

[4] Rosasco, Lorenzo; Vito, Ernesto De; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: date and year (link)

[5] This hypothesis space of functions is a Hilbert space of all the functions we're allowing the algorithm to pick

[6] For insight on choosing the parameter, see, e.g., Wahba, Grace; Wang, Yonghua (1990). "When is the optimal regularization parameter insensitive to the choice of the loss function". Communications in Statistics - Theory and Methods. 19 (5): 1685–1700. doi:10.1080/03610929008830285.{{cite journal}}: CS1 maint: date and year (link)

[7] See Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alex J. (2001). "A Generalized Representer Theorem". Computational Learning Theory: Lecture Notes in Computer Science. Lecture Notes in Computer Science. 2111: 416–426. doi:10.1007/3-540-44581-1_27. ISBN 978-3-540-42343-0.{{cite journal}}: CS1 maint: date and year (link)

[8] Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi:10.1198/016214504000000098. {{cite journal}}: Check date values in: |year= / |date= mismatch (help)

[9] Rosasco, Lorenzo; Vito, Ernesto De; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: date and year (link)

[10] For a detailed derivation, see Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).

[11] For more detail, see Rosasco, Lorenzo. "Regularized Least Squares and Support Vector Machines" (PDF).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]