# Population

• Let the population be of size $N$. These numerical values will be denoted by $x_1; x_2; ...; x_N$.
• $\mu = \frac{1}{N}\sum x_i$
• $\tau = N\mu$
• $\sigma^2 = \frac{1}{N} \sum(x_i - \mu)^2 = \frac{1}{N}\sum x_i^2 -\mu^2$

# Simple Random Sampling

• We will denote the sample size by $n$ ($n$ is less than $N$) and the values of the sample members by $X_1; X_2; ...; Xn$.
• $\hat{\mu} = \bar{X} = \frac{1}{n} \sum X_i$
• $\hat{\tau} = N\bar{X}$

## Sampling Mean

• $E(\bar{X} ) = \mu$
• $Var(\bar{X}) = \frac{\sigma^2} {n}$
• if sampling without replacement（default）:
• $Var(\bar{X}) = \frac{\sigma^2}{n} (1- \frac{n-1}{N-1})$
• $\sigma_{\bar{X}} \approx \frac{\sigma}{\sqrt{n}}$
• it is easy to see, the larger $n$ is, the small $\sigma_{\hat{X}}$ is, hence more accuracy

## Estimation of the population $\sigma$

### $\sigma^2$

• $\hat{\sigma}^2 = \frac{1}{n} \sum X_i^2 - \bar{X}^2$
• $E(\hat{\sigma}^2) = \sigma^2 (\frac{n-1}{n}) \frac{N}{N-1}$, this is a biased estimate of $\sigma^2$
• so the unbiased estimate of $\sigma^2$ is $\frac{1}{n-1}\frac{N-1}{N} \sum (X_i - \bar{X})^2$

### $Var(\bar{X})$

• $Var(\bar{X}) = \frac{\sigma^2}{n} (1- \frac{n-1}{N-1})$
• An unbiased estimate of $Var(\bar{X})$ is :
• $s_{\bar{X}}^2 = \frac{\hat{\sigma}}{n} (\frac{n}{n-1}) (\frac{N-1}{N}) (\frac{N-n}{N-1})$
• $E(s_{\bar{X}}^2) = Var(\bar{X})$
• 注意，这里是总体均值的估计量$\bar{X}$，其方差的无偏估计，目的是看该估计量的波动大小。

# Random Sampling

## Direct model

### Pseudo-random number generation

• how to get random distribution in computer?
• using uniform distribution to get

### Aliasing sample

• 在计算机处理中，找范围需要$\log n$的时间，而别名采样只需要常数时间。

## Indirect model

### Reject sampling

• 尽量选择最小的能包含原分布的分布，能使得拒绝的次数减少。
• 证明

## Stratified sampling

### population $\mu$ and $\sigma^2$

• The population mean and variance of the lth stratum are denoted by $\mu_l$ and $\sigma^2_l$
• $\mu = \frac{1}{N}\sum_{l=1}^{L}\sum_{i=1}^{N_i}x_{il} = \sum_{l=1}^{L} W_l\mu_l$

### sample $\mu$ and $\sigma^2$

• $l$ strata mean:
• $\bar{X_l} = \frac{1}{n_l}\sum\limits_{i=1}^{n_l}X_{il}$
• $l$ strata variance:
• $s_l^2 = \frac{1}{n_l-1}\sum\limits_{i=1}^{n_l}(X_{il} - \bar{X_l})^2$
• the obvious estimate of $\mu$
• $\bar{X_s} = \sum\limits_{l=1}^{L}\frac{N_l\bar{X_l}}{N} = \sum\limits_{l=1}^{L}W_l \bar X_l$
• it is an unbiased estimator of $\mu$
• The variance of the stratied sample mean is given by:
• $Var(\bar{X_s}) = \sum\limits_{l=1}^{L}W_l^2\frac{\sigma_l^2}{n_l}(1 - \frac{n_l-1}{N_l-1})$
• If the sampling fractions within all strata are small:
• $Var(\bar{X_s}) \approx \sum\limits_{l=1}^{L}W_l^2\frac{\sigma_l^2}{n_l}$
• using $s_l^2$ to replace $\sigma_l^2$, the estimator of $Var(\bar{X_s})$:
• $s_{\bar{X_s}}^2 = \sum\limits_{l=1}^L W_l^2 \frac{s_l^2}{n_l}(1-\frac{n_l-1}{N_l-1})$

### estimator of population

• $E(T_s) = \tau$
• $Var(T_s) = N^2Var(\bar{X_s}) = N^2\sum\limits_{l=1}^{L}W_l^2\frac{\sigma_l^2}{n_l}(1 - \frac{n_l-1}{N_l-1}) = \sum\limits_{l=1}^{L}N_l^2\frac{\sigma_l^2}{n_l}(1 - \frac{n_l-1}{N_l-1})$

### methods of allocation

#### ideal

The sample sizes $n_1;... ;n_L$ that minimize $Var (\bar{X_s})$ subject to the constraint $n_1 + .... + n_L = n$ are given by:

$n_l = n \frac{W_l\sigma_l}{\sum_{l=1}^{L}W_k\sigma_k}$

the stratified estimate using the optimal allocations as given in $n_l$ and neglecting the finite population correction:

$Var(\bar{X_{so}}) = \frac{(\sum W_l \sigma _l)^2}{n}$

#### $\sigma_l$ unknown

if $n_l = nW_l$:

$Var(\bar{X_{sp}}) = \sum\limits_{l=1}^L W_l^2\frac{\sigma^2_l}{n_l} = \frac{1}{n} \sum\limits_{l=1}^LW_l\sigma^2_l$

it is easy to see that the different of $Var(\bar{X_{sp}})$ and $Var(\bar{X_{so}})$:

$Var(\bar{X_{sp}}) - Var(\bar{X_{so}}) = \frac{1}{n}\sum\limits_{l=1}^L W_l(\sigma_l - \bar{\sigma})^2$

where $\bar{\sigma} = \sum\limits_{l=1}^L W_l\sigma_l$

# MCMC

We cannot sample directly from the target distribution $p(x)$ in the integral $f (x)p(x)dx$.

• Create a Markov chain whose transition matrix does not depend on the normalization term.
• Make sure the chain has a stationary distribution $\pi(i)$ and it is equal to the target distribution.
• After sufficient number of iterations, the chain will converge the stationary distribution.

## Markov Chain and Random Walk

• A random process is called a Markov Process if conditional on the current state, its future is independent of its past.
• $P(X_{n+1} = x_{n+1}| X_0 = x_0,...,X_n = x_n) = P(X_{n+1} = x_{n+1}|X_{n} = x_n)$
• The term Markov property refers to the memoryless property of a stochastic process.
• A Markov chain $X_t$ is said to be time homogeneous if $P(X_{s+t} = j | X_s = i)$ is independent of s. When this holds, putting s = 0 gives:
• $P(X_{s+t} = j| X_s = i) = P(X_t = j| X_0 = i)$
• If moreover $P(X_{n+1} = j |X_n = i) = P_{i}j$ is independent of $n$, then $X$ is said to be time homogeneous Markov chain.

### Transition matrix

The transition matrix is an $N\times N$ matrix of nonnegative entries such that the sum over each row of $P^{(t)}$ is 1, since $\forall n$:

$\sum_{y}P_{x,y}^{(t+1)} = \sum_yP[X_{t+1} = y| X_t = x] =1$

• this is not a symmetric matrix
• using normalized laplacian matrix to calculate the eigenvalues

### State distribution

Let $\pi(t)$ be the state distribution of the chain at time t, that $\pi^{(t)}_x =P[X_t = x]$.

For a finite chain, $\pi^{(t)}$ is a vector of $N$ nonnegative entries such that $\sum_x\pi_x^{(t)} =1$. Then, it holds that $\pi^{(t+1)} = \pi^{(t)}P^{(t+1)}$. We apply the law of total probability:

$\pi_y^{t+1} = \sum_xP[X_{t+1} = y| X_t = x]P[X_t = x] = \sum_x\pi_x^{(t)}P_{x,y}^{(t+1)}$

A stationary distribution of a finite Markov chain with transition matrix $P$ is a probability distribution $\pi$ such that $\pi P = \pi$.

### Irreducibility

State $y$ is accessible from state $x$ if it is possible for the chain to visit state $y$ if the chain starts in state $x$, in other words, $P^n{(x,y)} > 0, \forall n$.

State $x$ communicates with state $y$ if $y$ is accessible from $x$ and $x$ is accessible from $y$. We say that the Markov chain is irreducible if all pairs of states communicates.

### Aperiodicity

The period of a state $x$ is the greatest common divisor (gcd), such that $d_x = gcd\{n|(P^n)x,x > 0\}$. A state is aperiodic if its period is 1. A Markov chain is aperiodic if all its states are aperiodic.

Theorem

• If the states $x$ and $y$ communicate, then $d_x = d_y$ .
• We have $(P^n)x,x = 0$ if $n \mod(d_x ) \ne 0$.

### Detailed Balance Condition

Let $X_0,X_1, ...,$ be an aperiodic Markov chain with transition matrix $P$ and distribution $\pi$. If the following condition holds,

$\pi(i)P_{ij} = \pi(j)P_{(ji)}$

then $\pi(x)$ is the stationary distribution of the Markov chain. The above equation is called the detailed balance condition.

## MCMC algorithm

### Revising the Markov Chain

• How to choose $\alpha(i , j)$ such that $\pi(i)P_{ij} \alpha (i , j) = \pi (j)P_{ji} \alpha(j,i)$:
• In terms of symmetry, we simply choose $\alpha (i , j) = \pi(j)P_{ji} ,\alpha (j,i) = \pi(i)P_{ij}$ . its easy to see the equation got.
• 这就是MCMC方法的基本思想：既然马尔科夫链可以收敛于一个平稳分布，如果这个分布恰好是我们需要采样的分布，那么当马尔科夫链在第n步收敛之后，其后续不断生成的序列X(n), X(n+1), ... 就可以当做是采样的样本。
• 我们构造一个$Q_{ij}$使得满足细致平衡条件，计算转移概率时，对$Q$的采样可以处理成先对$P$采样（可以用刚才说的均匀分布，或者正态分布也有成熟的方法，即box muller变换），然后把$\alpha$当成一个接受率的概念，随机采样之后看是否到达$\alpha(i,j) = \pi(j)P(j,i)$的条件，满足条件说明可以转移。

### MCMC Sampling Algorithm

• 现在我们的目的是要按照转移矩阵$Q$进行跳转，
• 3步代表先用原来的$P$进行跳转，跳转之后的值假设是$x$
• 4步的接受率就是将前后两个状态带入$\alpha(i,j) = \pi(j)P(j,i)$，如果要跳转，其概率应该为$P(x|x^{(i)})\alpha(i,j)$，比以前减小了，且这个概率值恰好为新的$Q$转移矩阵的概率值，满足细致平衡条件。

## Metropolis-Hastings algorithm

• 对于MCMC方法，唯一不好的地方是我们需要考虑$\alpha$，当$\alpha$很小的时候，我们需要进行很多次抽样才能前进一步，效率很低，因此，我们是否可以改进一下，使得$\alpha$增大？
• 对于目前第$x^{(i)}$，我们使用平稳分布的条件：
• $\pi(x^{(i)})P(x^{(j)}|x^{(ij)}) \alpha(i,j) = \pi(x^{(j)})P(x^{(i)}| x^{(j)})\alpha(j,i)$
• 我们只需要将右边的$\alpha$除到左边，或者将左边的$\alpha$除到右边，这样就增大了接受率。因此，现在的接受率为：
• $\alpha(i,j) = \min \{1,\frac{\pi(x)P(x^{(i)}| x)}{P(x^{(i)})P(x|x^{(i)})}\}$
• 因此，算法会稍微改变一点：

# Gibbs Sampling

• 提高接受率是一种很好的办法，那么有没有一种方法能保证接受率为1呢？

## Intuition

• 考虑一个二维分布：
• $P(x_1,y_1)P(y_2|x_1) = P(x_1) P(y_1|x_1)P(y_2|x_1)$
• $P(x_1,y_2)P(y_1|x_1) = P(x_1)P(y_2|x_1)P(y_1|x_1)$
• 不难发现：
• $P(x_1,y_1)P(y_2|x_1) =P(x_1,y_2)P(y_1|x_1)$
• 我们可以发现，若沿着坐标轴移动，其分布满足平稳分布。
• 因此：
• 很容易拓展到多维情形

## 优点

1. 不需要事先知道联合概率，用于只知道条件概率的情形
2. 每次接受率都为1
3. 达到平稳分布后，采样点逼近联合概率