How do we compose the network that performs the requisite function?

## Preliminary

• The bias can also be viewed as the weight of another input component that is always set to 1

• $z=\sum_{i} w_{i} x_{i}$

• What we learn: The ..parameters.. of the network

• Learning the network: Determining the values of these parameters such that the network computes the desired function

• How to learn a network?

• $\widehat{\boldsymbol{W}}=\underset{W}{\operatorname{argmin}} \int_{X} \operatorname{div}(f(X ; W), g(X)) d$
• div() is a divergence function thet goes to zero when $f(X ; W)=g(X)$
• But in practice $g(x)$ will not have such specification

• Sample $g(x)$: just gather training data

## Learning

### Simple perceptron

do For $i = 1.. N_{train}$

$O(x_i) = sign(W^TX_i)$

if $O(x_i) \neq y_i$

$W = W+Y_iX_i$

until no more classification errors

#### A more complex problem

• This can be perfectly represented using an MLP
• But perveptron algorithm require linearly separated labels to be learned in lower-level neurons
• An exponential search over inputs
• So we need differentiable function to compute the change in the output for ..small.. changes in either the input or the weights

### Empirical Risk Minimization

Assuming $X$ is a random variable:

\begin{aligned} \widehat{\boldsymbol{W}}=& \underset{W}{\operatorname{argmin}} \int_{X} \operatorname{div}(f(X ; W), g(X)) P(X) d X \\\\ &=\underset{W}{\operatorname{argmin}} E[\operatorname{div}(f(X ; W), g(X))] \end{aligned}

Sample $g(X)$, where $d_{i}=g\left(X_{i}\right)+ noise$, estimate function from the samples

The empirical estimate of the expected error is the average error over the samples $E[\operatorname{div}(f(X ; W), g(X))] \approx \frac{1}{N} \sum_{i=1}^{N} \operatorname{div}\left(f\left(X_{i} ; W\right), d_{i}\right)$

Empirical average error (Empirical Risk) on all training data $\operatorname{Loss}(W)=\frac{1}{N} \sum_{i} \operatorname{div}\left(f\left(X_{i} ; W\right), d_{i}\right)$

Estimate the parameters to minimize the empirical estimate of expected error $\widehat{\boldsymbol{W}}=\underset{W}{\operatorname{argmin}} \operatorname{Loss}(W)$

## Problem statement

• Given a training set of input-output pairs

$\left(\boldsymbol{X}\_{1}, \boldsymbol{d}\_{1}\right),\left(\boldsymbol{X}\_{2}, \boldsymbol{d}\_{2}\right), \ldots,\left(\boldsymbol{X}\_{N}, \boldsymbol{d}\_{N}\right)$

• Minimize the following function

$\operatorname{Loss}(W)=\frac{1}{N} \sum_{i} \operatorname{div}\left(f\left(X_{i} ; W\right), d_{i}\right)$

• This is problem of function minimization
• An instance of optimization