## concept

1. Generalization error
• Measure of how accurately an algorithm is able to predict outcome values for previously unseen data.
• Bias
• High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
• Model is too 'simple'
• Variance
• an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise) in the training data, rather than the intended outputs (overfitting).
• Model is too 'complex'
• 如何度量bias和variance最终决定了我们的模型选择
2. Union bound
• $P(A_1 \cup ... \cup A_k) \le P(A_1) + ... + P(A_k)$
3. Hoeffding inequality(Chernoff bound)
• 设观测变量$Z_k$独立同分布，且服从两点分布，参数为$\phi$
• $P(|\phi - \hat{\phi}| > \gamma) \le 2\exp(-2\gamma^2 m)$
• 也就是说，如果我们的样本数$m$越来越大，则估计值会越来越接近真实值
4. Expirical risk/error
• $\hat{\epsilon }(h) = \frac{1}{m} \sum\limits_{i=1}^m 1(h(x_i)\ne y_i)$
5. Empirical risk minimization (ERM)
• 由于我们只有样本，因此最直观的方式就是经验风险最小化：
• $\hat{\theta} = \arg\min \hat{\epsilon }(h_\theta)$
6. Hypothesis class
• Set of all classiﬁers over X (the domain of the inputs)
• 其实就是一种类型model的集合，例如线性模型，神经网络等等
7. 因此，经验风险最小化实际上就是在Hypothesis class中，选择一个最好的：
• $\hat{h} = \arg \min \hat{\epsilon (h)}$

## finite hypothesis class

1. 这个估计可靠吗？
2. 这个估计对于generalization error的upper-bound是多少？

1. $m \ge\frac{1}{2\gamma^2}\log \frac{2k}{\delta}$
• 可以发现，$k$的数量越大，需要的样本量也越大，以log进行增长（增长的很慢）
2. $r \le \sqrt{\frac{1}{2m}\log\frac{2k}{\delta}}$
• 样本量越大，error越小，同时，还可以证明，我们用经验风险最小化得到的model只比最优的model相差两个$\gamma$（需要在同样的hypothesis class中）
• $\epsilon(\hat{h}) \le \epsilon(h^\star)+ 2\gamma$

$\epsilon(\hat{h}) \le (\min \epsilon(h)) + 2\sqrt{\frac{1}{2m} \log \frac{2k}{\delta}}$

## infinite hypothesis class

$m \ge O (\frac{1}{\gamma^2}\log \frac{k}{\delta} ) = O(\frac{d}{\gamma^2}\log\frac{1}{\delta})$

we conclude that (for an algorithm that tries to minimize training error) the number of training examples needed is usually roughly linear in the number of parameters of $H$.