Machine Learning Basics(2)
作者:互联网
CONTENTS
Capacity, Overfitting and Underfitting
-
The ability to perform well on previously unobserved inputs is called generalization.
-
The field of statistical learning theory provides some answers. If the training and the test set are collected arbitrarily, there is indeed little we can do. If we are allowed to make some assumptions about how the training and test set are collected, then we can make some progress.
-
The train and test data are generated by a probability distribution over datasets called the data generating process. We typically make a set of assumptions known collectively as the i.i.d. assumptions. These assumptions are that the examples in each dataset are independent from each other, and that the train set and test set are identically distributed, drawn from the same probability distribution as each other.
-
This assumption allows us to describe the data generating process with a probability distribution over a single example. The same distribution is then used to generate every train example and every test example. We call that shared underlying(潜在的) distribution the data generating distribution, denoted p data p_{\text {data }} pdata . This probabilistic framework and the i.i.d. assumptions allow us to mathematically study the relationship between training error and test error.
-
One immediate connection we can observe between the training and test error is that the expected training error of a randomly selected model is equal to the expected test error of that model.
-
Of course, when we use a machine learning algorithm, we do not fix the parameters ahead of time, then sample both datasets. We sample the training set, then use it to choose the parameters to reduce training set error, then sample the test set. Under this process, the expected test error is greater than or equal to the expected value of training error. The factors determining how well a machine learning algorithm will perform are its ability to:
- Make the training error small.
- Make the gap between training and test error small.
These two factors correspond to the two central challenges in machine learning: underfitting and overfitting. Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set. Overfitting occurs when the gap between the training error and test error is too large.
-
We can control whether a model is more likely to overfit or underfit by altering its capacity. Informally, a model’s capacity is its ability to fit a wide variety of functions. Models with low capacity may struggle to fit the training set. Models with high capacity can overfit by memorizing properties of the training set that do not serve them well on the test set.
-
One way to control the capacity of a learning algorithm is by choosing its hypothesis space, the set of functions that the learning algorithm is allowed to select as being the solution.
-
For example, the linear regression algorithm has the set of all linear functions of its input as its hypothesis space. We can generalize linear regression to include polynomials, rather than just linear functions, in its hypothesis space. Doing so increases the model’s capacity.
A polynomial of degree one gives us the linear regression model with which we are already familiar, with prediction
y ^ = b + w x . \hat{y}=b+w x \text { . } y^=b+wx .
By introducing x 2 x^{2} x2 as another feature provided to the linear regression model, we can learn a model that is quadratic as a function of x x x :
y ^ = b + w 1 x + w 2 x 2 \hat{y}=b+w_{1} x+w_{2} x^{2} y^=b+w1x+w2x2
Though this model implements a quadratic function of its input, the output is still a linear function of the parameters, so we can still use the normal equations to train the model in closed form. We can continue to add more powers of x x x as additional features, for example to obtain a polynomial of degree 9 :
y ^ = b + ∑ i = 1 9 w i x i \hat{y}=b+\sum_{i=1}^{9} w_{i} x^{i} y^=b+i=1∑9wixi
Machine learning algorithms will generally perform best when their capacity is appropriate for the true complexity of the task they need to perform and the amount of training data they are provided with. Models with insufficient capacity are unable to solve complex tasks. Models with high capacity can solve complex tasks, but when their capacity is higher than needed to solve the present task they may overfit.
-
Figure above shows this principle in action. We compare a linear, quadratic and degree-9 predictor attempting to fit a problem where the true underlying function is quadratic. The linear function is unable to capture the curvature in the true underlying problem, so it underfits. The degree- 9 predictor is capable of representing the correct function, but it is also capable of representing infinitely many other functions that pass exactly through the training points, because we have more parameters than training examples. We have little chance of choosing a solution that generalizes well when so many wildly different solutions exist. In this example, the quadratic model is perfectly matched to the true structure of the task so it generalizes well to new data.
-
So far we have described only one way of changing a model’s capacity: by changing the number of input features it has, and simultaneously adding new parameters associated with those features.
-
There are in fact many ways of changing a model’s capacity. Capacity is not determined only by the choice of model. The model specifies which family of functions the learning algorithm can choose from when varying the parameters in order to reduce a training objective. This is called the representational capacity of the model. In many cases, finding the best function within this family is a very difficult optimization problem.
-
In practice, the learning algorithm does not actually find the best function, but merely one that significantly reduces the training error. These additional limitations, such as the imperfection of the optimization algorithm, mean that the learning algorithm’s effective capacity may be less than the representational capacity of the model family.
-
Occam’s razor states that among competing hypotheses that explain known observations equally well, one should choose the “simplest” one.
-
Statistical learning theory provides various means of quantifying model capacity. Among these, the most well-known is the Vapnik-Chervonenkis dimension, or VC dimension.
-
The VC dimension measures the capacity of a binary classifier. The VC dimension is defined as being the largest possible value of m for which there exists a training set of m m m different x x x points that the classifier can label arbitrarily.
-
Quantifying the capacity of the model allows statistical learning theory to make quantitative predictions. The most important results in statistical learning theory show that the discrepancy between training error and generalization error is bounded from above by a quantity that grows as the model capacity grows but shrinks as the number of training examples increases
-
To reach the most extreme case of arbitrarily high capacity, we introduce the concept of non-parametric models. So far, we have seen only parametric models, such as linear regression. Parametric models learn a function described by a parameter vector whose size is finite and fixed before any data is observed. Non-parametric models have no such limitation.
-
Sometimes, non-parametric models are just theoretical abstractions (such as an algorithm that searches over all possible probability distributions) that cannot be implemented in practice. However, we can also design practical non-parametric models by making their complexity a function of the training set size.
-
One example of such an algorithm is nearest neighbor regression. Unlike linear regression, which has a fixed-length vector of weights, the nearest neighbor regression model simply stores the X \boldsymbol{X} X and y \boldsymbol{y} y from the training set. When asked to classify a test point x \boldsymbol{x} x, the model looks up the nearest entry in the training set and returns the associated regression target. In other words, y ^ = y i \hat{y}=y_{i} y^=yi where i = arg min ∥ X i , i − x ∥ 2 2 i=\arg \min \left\|\boldsymbol{X}_{i, i}-\boldsymbol{x}\right\|_{2}^{2} i=argmin∥Xi,i−x∥22. The algorithm can also be generalized to distance metrics other than the L 2 L^{2} L2 norm, such as learned distance metrics. If the algorithm is allowed to break ties by averaging the y i y_{i} yi values for all X i , : \boldsymbol{X}_{i,:} Xi,: that are tied for nearest, then this algorithm is able to achieve the minimum possible training error (which might be greater than zero, if two identical inputs are associated with different outputs) on any regression dataset.
-
Finally, we can also create a non-parametric learning algorithm by wrapping a parametric learning algorithm inside another algorithm that increases the number of parameters as needed. For example, we could imagine an outer loop of learning that changes the degree of the polynomial learned by linear regression on top of a polynomial expansion of the input.
-
The ideal model is an oracle that simply knows the true probability distribution that generates the data. Even such a model will still incur some error on many problems, because there may still be some noise in the distribution. In the case of supervised learning, the mapping from x x x to y y y may be inherently stochastic, or y y y may be a deterministic function that involves other variables besides those included in x x x. The error incurred by an oracle making predictions from the true distribution p ( x , y ) p (x , y ) p(x,y) is called the Bayes error.
-
Training and generalization error vary as the size of the training set varies. Expected generalization error can never increase as the number of training examples increases. For non-parametric models, more data yields better generalization until the best possible error is achieved. Any fixed parametric model with less than optimal capacity will asymptote to an error value that exceeds the Bayes error.
The No Free Lunch Theorem
- The no free lunch theorem for machine learning states that, averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points. In other words, in some sense, no machine learning algorithm is universally any better than any other. The most sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class.
- Fortunately, these results hold only when we average over all possible data generating distributions. If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions.
- This means that the goal of machine learning research is not to seek a universallearning algorithm or the absolute best learning algorithm. Instead, our goal is to understand what kinds of distributions are relevant to the “real world” that an AI agent experiences, and what kinds of machine learning algorithms perform well on data drawn from the kinds of data generating distributions we care about.
Regularization
- There are many other ways of expressing preferences for different solutions, both implicitly and explicitly. Together, these different approaches are known as regularization.
- Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its
training error. Regularization is one of the central concerns of the field of machine learning, rivaled(竞争) in its importance only by optimization. - The no free lunch theorem has made it clear that there is no best machine learning algorithm, and, in particular, no best form of regularization. Instead we must choose a form of regularization that is well-suited to the particular task we want to solve.
Hyperparameters and Validation Sets
- Most machine learning algorithms have several settings that we can use to control the behavior of the learning algorithm. These settings are called hyperparameters. The values of hyperparameters are not adapted by the learning algorithm itself (though we can design a nested learning procedure where one learning algorithm learns the best hyperparameters for another learning algorithm).
- It is important that the test examples are not used in any way to make choices about the model, including
its hyperparameters. For this reason, no example from the test set can be used in the validation set. Therefore, we always construct the validation set from the training data. Specifically, we split the training data into two disjoint subsets. - One of these subsets is used to learn the parameters. The other subset is our validation set, used to estimate the generalization error during or after training, allowing for the hyperparameters to be updated accordingly.
- The subset of data used to learn the parameters is still typically called the training set, even though this may be confused with the larger pool of data used for the entire training process.
- The subset of data used to guide the selection of hyperparameters is called the validation set. Typically, one uses about 80% of the training data for training and 20% for validation. Since the validation set is used to “train” the hyperparameters, the validation set error will underestimate the generalization error, though typically by a smaller amount than the training error. After all hyperparameter optimization is complete, the generalization error may be estimated using the test set.
- In practice, when the same test set has been used repeatedly to evaluate performance of different algorithms over many years, and especially if we consider all the attempts from the scientific community at beating the reported state-of- the-art performance on that test set, we end up having optimistic evaluations with the test set as well. Benchmarks can thus become stale(变得陈旧) and then do not reflect the true field performance of a trained system. Thankfully, the community tends to move on to new (and usually more ambitious and larger) benchmark datasets.
Cross-Validation
-
Dividing the dataset into a fixed training set and a fixed test set can be problematic if it results in the test set being small. A small test set implies statistical uncertainty around the estimated average test error, making it difficult to claim that algorithm A A A works better than algorithm B B B on the given task.
-
The most common of these is the k-fold cross-validation procedure, in which a partition of the dataset is formed by splitting it into k k k non-overlapping subsets. The test error may then be estimated by taking the average test error across k k k trials. On trial $i, the i i i -th subset of the data is used as the test set and the rest of the data is used as the training set.
The k k k -fold cross-validation algorithm.
- It can be used to estimate generalization error of a learning algorithm
A
A
A when the given dataset
D
\mathbb{D}
D is too small for a simple train/test or train/valid split to yield accurate estimation of generalization error, because the mean of a loss
L
L
L on a small test set may have too high variance. The dataset
D
\mathbb{D}
D contains as elements the abstract examples
z
(
i
)
\boldsymbol{z}^{(i)}
z(i) (for the
i
i
i -th example), which could stand for an (input,target) pair
z
(
i
)
=
(
x
(
i
)
,
y
(
i
)
)
\boldsymbol{z}^{(i)}=\left(\boldsymbol{x}^{(i)}, y^{(i)}\right)
z(i)=(x(i),y(i)) in the case of supervised learning, or for just an input
z
(
i
)
=
x
(
i
)
\boldsymbol{z}^{(i)}=\boldsymbol{x}^{(i)}
z(i)=x(i) in the case of unsupervised learning. The algorithm returns the vector of errors
e
\boldsymbol{e}
e for each example in
D
\mathbb{D}
D, whose mean is the estimated generalization error. The errors on individual examples can be used to compute a confidence interval around the mean (equation 5.47). While these confidence intervals are not well-justified after the use of cross-validation, it is still common practice to use them to declare that algorithm
A
A
A is better than algorithm
B
B
B only if the confidence interval of the errol of algorithm
A
A
A lies below and does not intersect the confidence interval of algorithm
B
B
B.
- One problem is that there exist no unbiased estimators of the variance of such average error estimators, but approximations are typically used.
Estimators, Bias and Variance
- The field of statistics gives us many tools that can be used to achieve the machine learning goal of solving a task not only on the training set but also to generalize. Foundational concepts such as parameter estimation, bias and variance are useful to formally characterize notions of generalization, underfitting and overfitting.
Point Estimation
-
In order to distinguish estimates of parameters from their true value, our convention will be to denote a point estimate of a parameter θ \boldsymbol{\theta} θ by θ ^ \hat{\boldsymbol{\theta}} θ^.
-
Let { x ( 1 ) , … , x ( m ) } \left\{\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(m)}\right\} {x(1),…,x(m)} be a set of m m m independent and identically distributed (i.i.d.) data points. A point estimator or statistic is any function of the data:
θ ^ m = g ( x ( 1 ) , … , x ( m ) ) \hat{\boldsymbol{\theta}}_{m}=g\left(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(m)}\right) θ^m=g(x(1),…,x(m))
The definition does not require that g g g return a value that is close to the true θ \theta θ or even that the range of g g g is the same as the set of allowable values of θ \theta θ. This definition of a point estimator is very general and allows the designer of an estimator great flexibility. While almost any function thus qualifies as an estimator, a good estimator is a function whose output is close to the true underlying θ \boldsymbol{\theta} θ that generated the training data. -
For now, we take the frequentist perspective on statistics. That is, we assume that the true parameter value θ \boldsymbol{\theta} θ is fixed but unknown, while the point estimate θ ^ \hat{\boldsymbol{\theta}} θ^ is a function of the data. Since the data is drawn from a random process, any function of the data is random. Therefore θ ^ \hat{\boldsymbol{\theta}} θ^ is a random variable.
-
Point estimation can also refer to the estimation of the relationship between input and target variables. We refer to these types of point estimates as function estimators.
-
Function Estimation :As we mentioned above, sometimes we are interested in performing function estimation (or function approximation). Here we are trying to predict a variable y \boldsymbol{y} y given an input vector x . \boldsymbol{x} . x. We assume that there is a function f ( x ) f(\boldsymbol{x}) f(x) that describes the approximate relationship between y \boldsymbol{y} y and x \boldsymbol{x} x. For example, we may assume that y = f ( x ) + ϵ \boldsymbol{y}=f(\boldsymbol{x})+\boldsymbol{\epsilon} y=f(x)+ϵ, where ϵ \boldsymbol{\epsilon} ϵ stands for the part of y \boldsymbol{y} y that is not predictable from x \boldsymbol{x} x. In function estimation, we are interested in approximating f f f with a model or estimate f ^ \hat{f} f^. Function estimation is really just the same as estimating a parameter θ \boldsymbol{\theta} θ; the function estimator f ^ \hat{f} f^ is simply a point estimator in function space.
-
We now review the most commonly studied properties of point estimators and discuss what they tell us about these estimators.
Bias
-
The bias of an estimator is defined as:
bias ( θ ^ m ) = E ( θ ^ m ) − θ \operatorname{bias}\left(\hat{\boldsymbol{\theta}}_{m}\right)=\mathbb{E}\left(\hat{\boldsymbol{\theta}}_{m}\right)-\boldsymbol{\theta} bias(θ^m)=E(θ^m)−θ
where the expectation is over the data (seen as samples from a random variable) and θ \boldsymbol{\theta} θ is the true underlying value of θ \boldsymbol{\theta} θ used to define the data generating distribution. -
An estimator θ ^ m \hat{\boldsymbol{\theta}}_{m} θ^m is said to be unbiased if bias ( θ ^ m ) = 0 \operatorname{bias}\left(\hat{\boldsymbol{\theta}}_{m}\right)=\mathbf{0} bias(θ^m)=0, which implies that E ( θ ^ m ) = θ \mathbb{E}\left(\hat{\boldsymbol{\theta}}_{m}\right)=\boldsymbol{\theta} E(θ^m)=θ.
An estimator θ ^ m \hat{\boldsymbol{\theta}}_{m} θ^m is said to be asymptotically(渐进地) unbiased if lim m → ∞ bias ( θ ^ m ) = 0 \lim _{m \rightarrow \infty} \operatorname{bias}\left(\hat{\boldsymbol{\theta}}_{m}\right)=\mathbf{0} limm→∞bias(θ^m)=0, which implies that lim m → ∞ E ( θ ^ m ) = θ \lim _{m \rightarrow \infty} \mathbb{E}\left(\hat{\boldsymbol{\theta}}_{m}\right)=\boldsymbol{\theta} limm→∞E(θ^m)=θ
Example: Bernoulli Distribution \quad -
Consider a set of samples { x ( 1 ) , … , x ( m ) } \left\{x^{(1)}, \ldots, x^{(m)}\right\} {x(1),…,x(m)} that are independently and identically distributed according to a Bernoulli distribution with mean θ \theta θ :
P ( x ( i ) ; θ ) = θ x ( i ) ( 1 − θ ) ( 1 − x ( i ) ) . P\left(x^{(i)} ; \theta\right)=\theta^{x^{(i)}}(1-\theta)^{\left(1-x^{(i)}\right)} . P(x(i);θ)=θx(i)(1−θ)(1−x(i)).
A common estimator for the θ \theta θ parameter of this distribution is the mean of the training samples:
θ ^ m = 1 m ∑ i = 1 m x ( i ) \hat{\theta}_{m}=\frac{1}{m} \sum_{i=1}^{m} x^{(i)} θ^m=m1i=1∑mx(i)
To determine whether this estimator is biased:
bias ( θ ^ m ) = E [ θ ^ m ] − θ = E [ 1 m ∑ i = 1 m x ( i ) ] − θ = 1 m ∑ i = 1 m E [ x ( i ) ] − θ = 1 m ∑ i = 1 m ∑ x ( i ) = 0 1 ( x ( i ) θ x ( i ) ( 1 − θ ) ( 1 − x ( i ) ) ) − θ = 1 m ∑ i = 1 m ( θ ) − θ = θ − θ = 0 \begin{aligned} \operatorname{bias}\left(\hat{\theta}_{m}\right) &=\mathbb{E}\left[\hat{\theta}_{m}\right]-\theta \\ &=\mathbb{E}\left[\frac{1}{m} \sum_{i=1}^{m} x^{(i)}\right]-\theta \\ &=\frac{1}{m} \sum_{i=1}^{m} \mathbb{E}\left[x^{(i)}\right]-\theta \\ &=\frac{1}{m} \sum_{i=1}^{m} \sum_{x^{(i)}=0}^{1}\left(x^{(i)} \theta^{x^{(i)}}(1-\theta)^{\left(1-x^{(i)}\right)}\right)-\theta \\ &=\frac{1}{m} \sum_{i=1}^{m}(\theta)-\theta \\ &=\theta-\theta=0 \end{aligned} bias(θ^m)=E[θ^m]−θ=E[m1i=1∑mx(i)]−θ=m1i=1∑mE[x(i)]−θ=m1i=1∑mx(i)=0∑1(x(i)θx(i)(1−θ)(1−x(i)))−θ=m1i=1∑m(θ)−θ=θ−θ=0
Since bias ( θ ^ ) = 0 \operatorname{bias}(\hat{\theta})=0 bias(θ^)=0, we say that our estimator θ ^ \hat{\theta} θ^ is unbiased.
Example: Gaussian Distribution Estimator of the Mean -
Now, consider a set of samples { x ( 1 ) , … , x ( m ) } \left\{x^{(1)}, \ldots, x^{(m)}\right\} {x(1),…,x(m)} that are independently and identically distributed according to a Gaussian distribution p ( x ( i ) ) = N ( x ( i ) ; μ , σ 2 ) p\left(x^{(i)}\right)=\mathcal{N}\left(x^{(i)} ; \mu, \sigma^{2}\right) p(x(i))=N(x(i);μ,σ2), where i ∈ { 1 , … , m } i \in\{1, \ldots, m\} i∈{1,…,m}.
Recall that the Gaussian probability density function is given by
p ( x ( i ) ; μ , σ 2 ) = 1 2 π σ 2 exp ( − 1 2 ( x ( i ) − μ ) 2 σ 2 ) p\left(x^{(i)} ; \mu, \sigma^{2}\right)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{1}{2} \frac{\left(x^{(i)}-\mu\right)^{2}}{\sigma^{2}}\right) p(x(i);μ,σ2)=2πσ2 1exp(−21σ2(x(i)−μ)2)
A common estimator of the Gaussian mean parameter is known as the sample mean:
μ ^ m = 1 m ∑ i = 1 m x ( i ) \hat{\mu}_{m}=\frac{1}{m} \sum_{i=1}^{m} x^{(i)} μ^m=m1i=1∑mx(i)
To determine the bias of the sample mean, we are again interested in calculating its expectation:
bias ( μ ^ m ) = E [ μ ^ m ] − μ = E [ 1 m ∑ i = 1 m x ( i ) ] − μ = ( 1 m ∑ i = 1 m E [ x ( i ) ] ) − μ = ( 1 m ∑ i = 1 m μ ) − μ = μ − μ = 0 \begin{aligned} \operatorname{bias}\left(\hat{\mu}_{m}\right) &=\mathbb{E}\left[\hat{\mu}_{m}\right]-\mu \\ &=\mathbb{E}\left[\frac{1}{m} \sum_{i=1}^{m} x^{(i)}\right]-\mu \\ &=\left(\frac{1}{m} \sum_{i=1}^{m} \mathbb{E}\left[x^{(i)}\right]\right)-\mu \\ &=\left(\frac{1}{m} \sum_{i=1}^{m} \mu\right)-\mu \\ &=\mu-\mu=0 \end{aligned} bias(μ^m)=E[μ^m]−μ=E[m1i=1∑mx(i)]−μ=(m1i=1∑mE[x(i)])−μ=(m1i=1∑mμ)−μ=μ−μ=0
Thus we find that the sample mean is an unbiased estimator of Gaussian mean parameter.Example: Estimators of the Variance of a Gaussian Distribution
-
As an example, we compare two different estimators of the variance parameter σ 2 \sigma^{2} σ2 of a Gaussian distribution. We are interested in knowing if either estimator is biased.
The first estimator of σ 2 \sigma^{2} σ2 we consider is known as the sample variance:
σ ^ m 2 = 1 m ∑ i = 1 m ( x ( i ) − μ ^ m ) 2 \hat{\sigma}_{m}^{2}=\frac{1}{m} \sum_{i=1}^{m}\left(x^{(i)}-\hat{\mu}_{m}\right)^{2} σ^m2=m1i=1∑m(x(i)−μ^m)2
where μ ^ m \hat{\mu}_{m} μ^m is the sample mean, defined above. More formally, we are interested in computing
bias ( σ ^ m 2 ) = E [ σ ^ m 2 ] − σ 2 \operatorname{bias}\left(\hat{\sigma}_{m}^{2}\right)=\mathbb{E}\left[\hat{\sigma}_{m}^{2}\right]-\sigma^{2} bias(σ^m2)=E[σ^m2]−σ2
We begin by evaluating the term E [ σ ^ m 2 ] \mathbb{E}\left[\hat{\sigma}_{m}^{2}\right] E[σ^m2] :
E [ σ ^ m 2 ] = E [ 1 m ∑ i = 1 m ( x ( i ) − μ ^ m ) 2 ] = m − 1 m σ 2 \begin{aligned} \mathbb{E}\left[\hat{\sigma}_{m}^{2}\right] &=\mathbb{E}\left[\frac{1}{m} \sum_{i=1}^{m}\left(x^{(i)}-\hat{\mu}_{m}\right)^{2}\right] \\ &=\frac{m-1}{m} \sigma^{2} \end{aligned} E[σ^m2]=E[m1i=1∑m(x(i)−μ^m)2]=mm−1σ2
Returning to definition, we conclude that the bias of σ ^ m 2 \hat{\sigma}_{m}^{2} σ^m2 is − σ 2 / m -\sigma^{2} / m −σ2/m. Therefore, the sample variance is a biased estimator.
The unbiased sample variance estimator
σ ~ m 2 = 1 m − 1 ∑ i = 1 m ( x ( i ) − μ ^ m ) 2 \tilde{\sigma}_{m}^{2}=\frac{1}{m-1} \sum_{i=1}^{m}\left(x^{(i)}-\hat{\mu}_{m}\right)^{2} σ~m2=m−11i=1∑m(x(i)−μ^m)2
provides an alternative approach. As the name suggests this estimator is unbiased. That is, we find that E [ σ ~ m 2 ] = σ 2 \mathbb{E}\left[\tilde{\sigma}_{m}^{2}\right]=\sigma^{2} E[σ~m2]=σ2
E [ σ ~ m 2 ] = E [ 1 m − 1 ∑ i = 1 m ( x ( i ) − μ ^ m ) 2 ] = m m − 1 E [ σ ^ m 2 ] = m m − 1 ( m − 1 m σ 2 ) = σ 2 . \begin{aligned} \mathbb{E}\left[\tilde{\sigma}_{m}^{2}\right] &=\mathbb{E}\left[\frac{1}{m-1} \sum_{i=1}^{m}\left(x^{(i)}-\hat{\mu}_{m}\right)^{2}\right] \\ &=\frac{m}{m-1} \mathbb{E}\left[\hat{\sigma}_{m}^{2}\right] \\ &=\frac{m}{m-1}\left(\frac{m-1}{m} \sigma^{2}\right) \\ &=\sigma^{2} . \end{aligned} E[σ~m2]=E[m−11i=1∑m(x(i)−μ^m)2]=m−1mE[σ^m2]=m−1m(mm−1σ2)=σ2.
We have two estimators: one is biased and the other is not. While unbiased estimators are clearly desirable, they are not always the “best” estimators. As we will see we often use biased estimators that possess other important properties.
Variance and Standard Error
-
Another property of the estimator that we might want to consider is how much we expect it to vary as a function of the data sample. Just as we computed the expectation of the estimator to determine its bias, we can compute its variance. The variance of an estimator is simply the variance
Var ( θ ^ ) \operatorname{Var}(\hat{\theta}) Var(θ^)
where the random variable is the training set. Alternately, the square root of the variance is called the standard error, denoted SE ( θ ^ ) \operatorname{SE}(\hat{\theta}) SE(θ^). -
The variance or the standard error of an estimator provides a measure of how we would expect the estimate we compute from data to vary as we independently resample the dataset from the underlying data generating process. Just as we might like an estimator to exhibit low bias we would also like it to have relatively low variance.
-
When we compute any statistic using a finite number of samples, our estimate of the true underlying parameter is uncertain, in the sense that we could have obtained other samples from the same distribution and their statistics would have been different. The expected degree of variation in any estimator is a source of error that we want to quantify. The standard error of the mean is given by
SE ( μ ^ m ) = Var [ 1 m ∑ i = 1 m x ( i ) ] = σ m \operatorname{SE}\left(\hat{\mu}_{m}\right)=\sqrt{\operatorname{Var}\left[\frac{1}{m} \sum_{i=1}^{m} x^{(i)}\right]}=\frac{\sigma}{\sqrt{m}} SE(μ^m)=Var[m1i=1∑mx(i)] =m σ
where σ 2 \sigma^{2} σ2 is the true variance of the samples x i x^{i} xi. The standard error is often estimated by using an estimate of σ \sigma σ. Unfortunately, neither the square root of the sample variance nor the square root of the unbiased estimator of the variance provide an unbiased estimate of the standard deviation. Both approaches tend to underestimate the true standard deviation, but are still used in practice. The square root of the unbiased estimator of the variance is less of an underestimate. For large m m m, the approximation is quite reasonable.Example: Bernoulli Distribution We once again consider a set of samples { x ( 1 ) , … , x ( m ) } \left\{x^{(1)}, \ldots, x^{(m)}\right\} {x(1),…,x(m)} drawn independently and identically from a Bernoulli distribution (recall P ( x ( i ) ; θ ) = θ x ( i ) ( 1 − θ ) ( 1 − x ( i ) ) ) . \left.P\left(x^{(i)} ; \theta\right)=\theta^{x^{(i)}}(1-\theta)^{\left(1-x^{(i)}\right)}\right) . P(x(i);θ)=θx(i)(1−θ)(1−x(i))). This time we are interested in computing the variance of the estimator θ ^ m = 1 m ∑ i = 1 m x ( i ) \hat{\theta}_{m}=\frac{1}{m} \sum_{i=1}^{m} x^{(i)} θ^m=m1∑i=1mx(i)
Var ( θ ^ m ) = Var ( 1 m ∑ i = 1 m x ( i ) ) = 1 m 2 ∑ i = 1 m Var ( x ( i ) ) = 1 m 2 ∑ i = 1 m θ ( 1 − θ ) = 1 m 2 m θ ( 1 − θ ) = 1 m θ ( 1 − θ ) \begin{aligned} \operatorname{Var}\left(\hat{\theta}_{m}\right) &=\operatorname{Var}\left(\frac{1}{m} \sum_{i=1}^{m} x^{(i)}\right) \\ &=\frac{1}{m^{2}} \sum_{i=1}^{m} \operatorname{Var}\left(x^{(i)}\right) \\ &=\frac{1}{m^{2}} \sum_{i=1}^{m} \theta(1-\theta) \\ &=\frac{1}{m^{2}} m \theta(1-\theta) \\ &=\frac{1}{m} \theta(1-\theta) \end{aligned} Var(θ^m)=Var(m1i=1∑mx(i))=m21i=1∑mVar(x(i))=m21i=1∑mθ(1−θ)=m21mθ(1−θ)=m1θ(1−θ)
The variance of the estimator decreases as a function of m m m, the number of examples in the dataset. This is a common property of popular estimators that we will return to when we discuss consistency.
Trading off Bias and Variance to Minimize Mean Squared Error
-
Bias and variance measure two different sources of error in an estimator.
Bias measures the expected deviation from the true value of the function or parameter.
Variance on the other hand, provides a measure of the deviation from the expected estimator value that any particular sampling of the data is likely to cause. -
What happens when we are given a choice between two estimators, one with more bias and one with more variance? How do we choose between them? For example, imagine that we are interested in approximating the function shown in figure below and we are only offered the choice between a model with large bias and one that suffers from large variance. How do we choose between them?
-
The most common way to negotiate this trade-off is to use cross-validation. Empirically, cross-validation is highly successful on many real-world tasks. Alternatively, we can also compare the mean squared error (MSE) of the estimates:
M S E = E [ ( θ ^ m − θ ) 2 ] = Bias ( θ ^ m ) 2 + Var ( θ ^ m ) \begin{aligned} \mathrm{MSE} &=\mathbb{E}\left[\left(\hat{\theta}_{m}-\theta\right)^{2}\right] \\ &=\operatorname{Bias}\left(\hat{\theta}_{m}\right)^{2}+\operatorname{Var}\left(\hat{\theta}_{m}\right) \end{aligned} MSE=E[(θ^m−θ)2]=Bias(θ^m)2+Var(θ^m)
The MSE measures the overall expected deviation—in a squared error sense—between the estimator and the true value of the parameter θ. As is clear from equation above , evaluating the MSE incorporates both the bias and the variance. -
Desirable estimators are those with small MSE and these are estimators that manage to keep both their bias and variance somewhat in check.
The relationship between bias and variance is tightly linked to the machine learning concepts of capacity, underfitting and overfitting. In the case where generalization error is measured by the MSE (where bias and variance are meaningful components of generalization error), increasing capacity tends to increase variance and decrease bias. This is illustrated in figure above , where we see again the U-shaped curve of generalization error as a function of capacity.
Consistency
- So far we have discussed the properties of various estimators for a training set of fixed size. Usually, we are also concerned with the behavior of an estimator as the amount of training data grows. In particular, we usually wish that, as the number of data points m in our dataset increases, our point estimates converge to the true value of the corresponding parameters. More formally, we would like that
plim m → ∞ θ ^ m = θ \operatorname{plim}_{m \rightarrow \infty} \hat{\theta}_{m}=\theta plimm→∞θ^m=θ - The symbol plim indicates convergence in probability, meaning that for any
ϵ
>
0
\epsilon>0
ϵ>0,
P
(
∣
θ
^
m
−
θ
∣
>
ϵ
)
→
0
P\left(\left|\hat{\theta}_{m}-\theta\right|>\epsilon\right) \rightarrow 0
P(∣∣∣θ^m−θ∣∣∣>ϵ)→0 as
m
→
∞
m \rightarrow \infty
m→∞.
The condition described by equation above is known as consistency. It is sometimes referred to as weak consistency, with strong consistency referring to the almost sure convergence of θ ^ \hat{\theta} θ^ to θ \theta θ. Almost sure convergence of a sequence of random variables x ( 1 ) , x ( 2 ) , … \mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots x(1),x(2),… to a value x \boldsymbol{x} x occurs when p ( lim m → ∞ x ( m ) = x ) = 1 p\left(\lim _{m \rightarrow \infty} \mathbf{x}^{(m)}=\boldsymbol{x}\right)=1 p(limm→∞x(m)=x)=1 - Consistency ensures that the bias induced by the estimator diminishes as the number of data examples grows. However, the reverse is not true asymptotic unbiasedness does not imply consistency. For example, consider estimating the mean parameter μ \mu μ of a normal distribution N ( x ; μ , σ 2 ) \mathcal{N}\left(x ; \mu, \sigma^{2}\right) N(x;μ,σ2), with a dataset consisting of m m m samples: { x ( 1 ) , … , x ( m ) } \left\{x^{(1)}, \ldots, x^{(m)}\right\} {x(1),…,x(m)}. We could use the first sample x ( 1 ) x^{(1)} x(1) of the dataset as an unbiased estimator: θ ^ = x ( 1 ) . \hat{\theta}=x^{(1)} . θ^=x(1). In that case, E ( θ ^ m ) = θ \mathbb{E}\left(\hat{\theta}_{m}\right)=\theta E(θ^m)=θ so the estimator is unbiased no matter how many data points are seen. This, of course, implies that the estimate is asymptotically unbiased. However, this is not a consistent estimator as it is not the case that θ ^ m → θ \hat{\theta}_{m} \rightarrow \theta θ^m→θ as m → ∞ m \rightarrow \infty m→∞.
Maximum Likelihood Estimation
-
Previously, we have seen some definitions of common estimators and analyzed their properties. But where did these estimators come from? Rather than guessing that some function might make a good estimator and then analyzing its bias and variance, we would like to have some principle from which we can derive specific functions that are good estimators for different models.
-
The most common such principle is the maximum likelihood principle.
Consider a set of m m m examples X = { x ( 1 ) , … , x ( m ) } \mathbb{X}=\left\{\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(m)}\right\} X={x(1),…,x(m)} drawn independently from the true but unknown data generating distribution p data ( x ) p_{\text {data }}(\mathbf{x}) pdata (x). -
Let p model ( x ; θ ) p_{\text {model }}(\mathbf{x} ; \boldsymbol{\theta}) pmodel (x;θ) be a parametric family of probability distributions over the same space indexed by θ \boldsymbol{\theta} θ. In other words, p model ( x ; θ ) p_{\text {model }}(\boldsymbol{x} ; \boldsymbol{\theta}) pmodel (x;θ) maps any configuration x \boldsymbol{x} x to a real number estimating the true probability p data ( x ) p_{\text {data }}(\boldsymbol{x}) pdata (x). The maximum likelihood estimator for θ \boldsymbol{\theta} θ is then defined as
θ M L = arg max θ p model ( X ; θ ) = arg max θ ∏ i = 1 m p model ( x ( i ) ; θ ) \begin{aligned} \boldsymbol{\theta}_{\mathrm{ML}} &=\underset{\boldsymbol{\theta}}{\arg \max } p_{\operatorname{model}}(\mathbb{X} ; \boldsymbol{\theta}) \\ &=\underset{\boldsymbol{\theta}}{\arg \max } \prod_{i=1}^{m} p_{\text {model }}\left(\boldsymbol{x}^{(i)} ; \boldsymbol{\theta}\right) \end{aligned} θML=θargmaxpmodel(X;θ)=θargmaxi=1∏mpmodel (x(i);θ) -
This product over many probabilities can be inconvenient for a variety of reasons. For example, it is prone(易于) to numerical underflow. To obtain a more convenient but equivalent optimization problem, we observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product into a sum:
θ M L = arg max θ ∑ i = 1 m log p model ( x ( i ) ; θ ) \boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \sum_{i=1}^{m} \log p_{\text {model }}\left(\boldsymbol{x}^{(i)} ; \boldsymbol{\theta}\right) θML=θargmaxi=1∑mlogpmodel (x(i);θ) -
Because the arg max does not change when we rescale the cost function, we can divide by m m m to obtain a version of the criterion that is expressed as an expectation with respect to the empirical distribution p ^ tata \hat{p}_{\text {tata }} p^tata defined by the training data:
θ M L = arg max θ E x ∼ p ^ data log p model ( x ; θ ) \boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \mathbb{E}_{\mathbf{x} \sim \hat{p}_{\text {data }}} \log p_{\text {model }}(\boldsymbol{x} ; \boldsymbol{\theta}) θML=θargmaxEx∼p^data logpmodel (x;θ) -
One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution p ^ data \hat{p}_{\text {data }} p^data defined by the training set and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence. The KL divergence is given by
D K L ( p ^ data ∥ p model ) = E x ∼ p ^ data [ log p ^ data ( x ) − log p model ( x ) ] D_{\mathrm{KL}}\left(\hat{p}_{\text {data }} \| p_{\text {model }}\right)=\mathbb{E}_{\mathbf{x} \sim \hat{p}_{\text {data }}}\left[\log \hat{p}_{\text {data }}(\boldsymbol{x})-\log p_{\text {model }}(\boldsymbol{x})\right] DKL(p^data ∥pmodel )=Ex∼p^data [logp^data (x)−logpmodel (x)] -
The term on the left is a function only of the data generating process, not the model. This means when we train the model to minimize the KL divergence, we need only minimize
− E x ∼ p ^ data [ log p model ( x ) ] -\mathbb{E}_{\mathbf{x} \sim \hat{p}_{\text {data }}}\left[\log p_{\text {model }}(\boldsymbol{x})\right] −Ex∼p^data [logpmodel (x)] -
Minimizing this KL divergence corresponds exactly to minimizing the crossentropy between the distributions.
-
Many authors use the term “cross-entropy” to identify specifically the negative log-likelihood of a Bernoulli or softmax distribution, but that is a misnomer.
-
Any loss consisting of a negative log-likelihood is a crossentropy between the empirical distribution defined by the training set and the probability distribution defined by model. For example, mean squared error is the cross-entropy between the empirical distribution and a Gaussian model.
-
We can thus see maximum likelihood as an attempt to make the model distribution match the empirical distribution p ^ data \hat{p}_{\text {data }} p^data . Ideally, we would like to match the true data generating distribution p data p_{\text {data }} pdata , but we have no direct access to this distribution.
Conditional Log-Likelihood and Mean Squared Error
- The maximum likelihood estimator can readily(容易) be generalized to the case where our goal is to estimate a conditional probability P ( y ∣ x ; θ ) P(\mathbf{y} \mid \mathbf{x} ; \boldsymbol{\theta}) P(y∣x;θ) in order to predict y \mathbf{y} y given x \mathrm{x} x. This is actually the most common situation because it forms the basis for most supervised learning.
- If
X
\boldsymbol{X}
X represents all our inputs and
Y
\boldsymbol{Y}
Y all our observed targets, then the conditional maximum likelihood estimator is
θ M L = arg max θ P ( Y ∣ X ; θ ) \boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } P(\boldsymbol{Y} \mid \boldsymbol{X} ; \boldsymbol{\theta}) θML=θargmaxP(Y∣X;θ)
If the examples are assumed to be i.i.d., then this can be decomposed into
θ M L = arg max θ ∑ i = 1 m log P ( y ( i ) ∣ x ( i ) ; θ ) \boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \sum_{i=1}^{m} \log P\left(\boldsymbol{y}^{(i)} \mid \boldsymbol{x}^{(i)} ; \boldsymbol{\theta}\right) θML=θargmaxi=1∑mlogP(y(i)∣x(i);θ)
Example: Linear Regression as Maximum Likelihood Linear regression, introduced earlier in section before, may be justified as a maximum likelihood procedure. Previously, we motivated linear regression as an algorithm that learns to take an input x x x and produce an output value y ^ \hat{y} y^. The mapping from x x x to y ^ \hat{y} y^ is chosen to minimize mean squared error, a criterion that we introduced more or less arbitrarily. We now revisit linear regression from the point of view of maximum likelihood estimation. Instead of producing a single prediction y ^ \hat{y} y^, we now think of the model as producing a conditional distribution p ( y ∣ x ) p(y \mid \boldsymbol{x}) p(y∣x). We can imagine that with an infinitely large training set, we might see several training examples with the same input value x \boldsymbol{x} x but different values of y y y. The goal of the learning algorithm is now to fit the distribution p ( y ∣ x ) p(y \mid \boldsymbol{x}) p(y∣x) to all of those different y y y values that are all compatible with x x x. To derive the same linear regression algorithm we obtained before, we define p ( y ∣ x ) = N ( y ; y ^ ( x ; w ) , σ 2 ) p(y \mid \boldsymbol{x})=\mathcal{N}\left(y ; \hat{y}(\boldsymbol{x} ; \boldsymbol{w}), \sigma^{2}\right) p(y∣x)=N(y;y^(x;w),σ2). The function y ^ ( x ; w ) \hat{y}(\boldsymbol{x} ; \boldsymbol{w}) y^(x;w) gives the prediction of the mean of the Gaussian. In this example, we assume that the variance is fixed to some constant σ 2 \sigma^{2} σ2 chosen by the user. We will see that this choice of the functional form of p ( y ∣ x ) p(y \mid \boldsymbol{x}) p(y∣x) causes the maximum likelihood estimation procedure to yield the same learning algorithm as we developed before. Since the examples are assumed to be i.i.d., the conditional log-likelihood (equation 5.63) is given by
∑ i = 1 m log p ( y ( i ) ∣ x ( i ) ; θ ) = − m log σ − m 2 log ( 2 π ) − ∑ i = 1 m ∥ y ^ ( i ) − y ( i ) ∥ 2 2 σ 2 \begin{aligned} & \sum_{i=1}^{m} \log p\left(y^{(i)} \mid \boldsymbol{x}^{(i)} ; \boldsymbol{\theta}\right) \\ =&-m \log \sigma-\frac{m}{2} \log (2 \pi)-\sum_{i=1}^{m} \frac{\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2}}{2 \sigma^{2}} \end{aligned} =i=1∑mlogp(y(i)∣x(i);θ)−mlogσ−2mlog(2π)−i=1∑m2σ2∥∥y^(i)−y(i)∥∥2
where y ^ ( i ) \hat{y}^{(i)} y^(i) is the output of the linear regression on the i i i -th input x ( i ) \boldsymbol{x}^{(i)} x(i) and m m m is the number of the training examples. Comparing the log-likelihood with the mean squared error,
M S E train = 1 m ∑ i = 1 m ∥ y ^ ( i ) − y ( i ) ∥ 2 \mathrm{MSE}_{\text {train }}=\frac{1}{m} \sum_{i=1}^{m}\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2} MSEtrain =m1i=1∑m∥∥∥y^(i)−y(i)∥∥∥2
we immediately see that maximizing the log-likelihood with respect to w \boldsymbol{w} w yields the same estimate of the parameters w \boldsymbol{w} w as does minimizing the mean squared error. The two criteria have different values but the same location of the optimum. This justifies the use of the MSE as a maximum likelihood estimation procedure. As we will see, the maximum likelihood estimator has several desirable properties.
Properties of Maximum Likelihood
-
The main appeal of the maximum likelihood estimator is that it can be shown to be the best estimator asymptotically, as the number of examples m → ∞ m \rightarrow \infty m→∞, in terms of its rate of convergence as m m m increases.
-
Under appropriate conditions, the maximum likelihood estimator has the property of consistency , meaning that as the number of training examples approaches infinity, the maximum likelihood estimate of a parameter converges to the true value of the parameter. These conditions are:
-The true distribution p data p_{\text {data }} pdata must lie within the model family p model ( ⋅ ; θ ) p_{\text {model }}(\cdot ; \boldsymbol{\theta}) pmodel (⋅;θ). Otherwise, no estimator can recover p data p_{\text {data }} pdata .
-The true distribution p data p_{\text {data }} pdata must correspond to exactly one value of θ \boldsymbol{\theta} θ. Otherwise, maximum likelihood can recover the correct p data p_{\text {data }} pdata , but will not be able to determine which value of θ \boldsymbol{\theta} θ was used by the data generating processing. -
consistent estimators can differ in their statistic efficiency, meaning that one consistent estimator may obtain lower generalization error for a fixed number of samples m, or equivalently, may require fewer examples to obtain a fixed level of generalization error.
-
Statistical efficiency is typically studied in the parametric case (like in linear regression) where our goal is to estimate the value of a parameter (and assuming it is possible to identify the true parameter), not the value of a function.
-
A way to measure how close we are to the true parameter is by the expected mean squared error, computing the squared difference between the estimated and true parameter values, where the expectation is over m m m training samples from the data generating distribution. That parametric mean squared error decreases as m m m increases, and for m m m large, the Cramér-Rao lower bound shows that no consistent estimator has a lower mean squared error than the maximum likelihood estimator.
-
For these reasons (consistency and efficiency), maximum likelihood is often considered the preferred estimator to use for machine learning. When the number of examples is small enough to yield overfitting behavior, regularization strategies such as weight decay may be used to obtain a biased version of maximum likelihood that has less variance when training data is limited.
Bayesian Statistics
-
So far we have discussed frequentist statistics and approaches based on estimating a single value of θ \boldsymbol{\theta} θ, then making all predictions thereafter based on that one estimate.
Another approach is to consider all possible values of θ \boldsymbol{\theta} θ when making a prediction. The latter is the domain of Bayesian statistics. -
The frequentist perspective is that the true parameter value θ \boldsymbol{\theta} θ is fixed but unknown, while the point estimate θ ^ \hat{\boldsymbol{\theta}} θ^ is a random variable on account of it being a function of the dataset (which is seen as random).
-
The Bayesian perspective on statistics is quite different. The Bayesian uses probability to reflect degrees of certainty of states of knowledge. The dataset is directly observed and so is not random. On the other hand, the true parameter θ \theta θ is unknown or uncertain and thus is represented as a random variable.
-
Before observing the data, we represent our knowledge of θ \boldsymbol{\theta} θ using the prior probability distribution, p ( θ ) p(\theta) p(θ) (sometimes referred to as simply “the prior”). Generally, the machine learning practitioner(从业者)selects a prior distribution that is quite broad (i.e. with high entropy) to reflect a high degree of uncertainty in the value of θ \boldsymbol{\theta} θ before observing any data.
For example, one might assume a priori that θ \boldsymbol{\theta} θ lies in some finite range or volume, with a uniform distribution. Many priors instead reflect a preference for “simpler” solutions (such as smaller magnitude coefficients, or a function that is closer to being constant) -
Now consider that we have a set of data samples { x ( 1 ) , … , x ( m ) } \left\{x^{(1)}, \ldots, x^{(m)}\right\} {x(1),…,x(m)}. We can recover the effect of data on our belief about θ \boldsymbol{\theta} θ by combining the data likelihood p ( x ( 1 ) , … , x ( m ) ∣ θ ) p\left(x^{(1)}, \ldots, x^{(m)} \mid \boldsymbol{\theta}\right) p(x(1),…,x(m)∣θ) with the prior via Bayes’ rule:
p ( θ ∣ x ( 1 ) , … , x ( m ) ) = p ( x ( 1 ) , … , x ( m ) ∣ θ ) p ( θ ) p ( x ( 1 ) , … , x ( m ) ) p\left(\boldsymbol{\theta} \mid x^{(1)}, \ldots, x^{(m)}\right)=\frac{p\left(x^{(1)}, \ldots, x^{(m)} \mid \boldsymbol{\theta}\right) p(\boldsymbol{\theta})}{p\left(x^{(1)}, \ldots, x^{(m)}\right)} p(θ∣x(1),…,x(m))=p(x(1),…,x(m))p(x(1),…,x(m)∣θ)p(θ)
In the scenarios where Bayesian estimation is typically used, the prior begins as a relatively uniform or Gaussian distribution with high entropy, and the observation of the data usually causes the posterior to lose entropy and concentrate around a few highly likely values of the parameters. -
Relative to maximum likelihood estimation, Bayesian estimation offers two important differences.
-
First, unlike the maximum likelihood approach that makes predictions using a point estimate of θ \boldsymbol{\theta} θ, the Bayesian approach is to make predictions using a full distribution over θ \boldsymbol{\theta} θ. For example, after observing m m m examples, the predicted distribution over the next data sample, x ( m + 1 ) x^{(m+1)} x(m+1), is given by
p ( x ( m + 1 ) ∣ x ( 1 ) , … , x ( m ) ) = ∫ p ( x ( m + 1 ) ∣ θ ) p ( θ ∣ x ( 1 ) , … , x ( m ) ) d θ p\left(x^{(m+1)} \mid x^{(1)}, \ldots, x^{(m)}\right)=\int p\left(x^{(m+1)} \mid \boldsymbol{\theta}\right) p\left(\boldsymbol{\theta} \mid x^{(1)}, \ldots, x^{(m)}\right) d \boldsymbol{\theta} p(x(m+1)∣x(1),…,x(m))=∫p(x(m+1)∣θ)p(θ∣x(1),…,x(m))dθ
Here each value of θ \boldsymbol{\theta} θ with positive probability density contributes to the prediction of the next example, with the contribution weighted by the posterior density itself. After having observed { x ( 1 ) , … , x ( m ) } \left\{x^{(1)}, \ldots, x^{(m)}\right\} {x(1),…,x(m)}, if we are still quite uncertain about the value of θ \boldsymbol{\theta} θ, then this uncertainty is incorporated directly into any predictions we might make. -
We have discussed how the frequentist approach addresses the uncertainty in a given point estimate of θ \boldsymbol{\theta} θ by evaluating its variance. The variance of the estimator is an assessment of how the estimate might change with alternative samplings of the observed data.
-
The Bayesian answer to the question of how to deal with the uncertainty in the estimator is to simply integrate over it, which tends to protect well against overfitting. This integral is of course just an application of the laws of probability, making the Bayesian approach simple to justify, while the frequentist machinery for constructing an estimator is based on the rather ad hoc decision(临时决定) to summarize all knowledge contained in the dataset with a single point estimate.
-
The second important difference between the Bayesian approach to estimation and the maximum likelihood approach is due to the contribution of the Bayesian prior distribution. The prior has an influence by shifting probability mass density towards regions of the parameter space that are preferred a priori. In practice, the prior often expresses a preference for models that are simpler or more smooth. Critics of the Bayesian approach identify the prior as a source of subjective human judgment impacting the predictions.
-
Bayesian methods typically generalize much better when limited training data is available, but typically suffer from high computational cost when the number of training examples is large.
Example: Bayesian Linear Regression Here we consider the Bayesian estimation approach to learning the linear regression parameters. In linear regression, we learn a linear mapping from an input vector x ∈ R n \boldsymbol{x} \in \mathbb{R}^{n} x∈Rn to predict the value of a scalar y ∈ R y \in \mathbb{R} y∈R. The prediction is parametrized by the vector w ∈ R n \boldsymbol{w} \in \mathbb{R}^{n} w∈Rn :
y ^ = w ⊤ x \hat{y}=\boldsymbol{w}^{\top} \boldsymbol{x} y^=w⊤x
Given a set of m m m training samples ( X ( train ) , y ( train ) ) \left(\boldsymbol{X}^{(\text {train })}, \boldsymbol{y}^{(\text {train })}\right) (X(train ),y(train )), we can express the prediction of y y y over the entire training set as:
y ^ ( train ) = X ( train ) w \hat{\boldsymbol{y}}^{(\text {train })}=\boldsymbol{X}^{(\operatorname{train})} \boldsymbol{w} y^(train )=X(train)w
Expressed as a Gaussian conditional distribution on y ( train ) \boldsymbol{y}^{(\text {train })} y(train ), we have
p ( y ( train ) ∣ X ( train ) , w ) = N ( y ( train ) ; X ( train ) w , I ) ∝ exp ( − 1 2 ( y ( train ) − X ( train ) w ) ⊤ ( y ( train ) − X ( train ) w ) ) \begin{aligned} p\left(\boldsymbol{y}^{(\text {train })} \mid \boldsymbol{X}^{(\text {train })}, \boldsymbol{w}\right) &=\mathcal{N}\left(\boldsymbol{y}^{(\text {train })} ; \boldsymbol{X}^{(\text {train })} \boldsymbol{w}, \boldsymbol{I}\right) \\ & \propto \exp \left(-\frac{1}{2}\left(\boldsymbol{y}^{(\text {train })}-\boldsymbol{X}^{(\operatorname{train})} \boldsymbol{w}\right)^{\top}\left(\boldsymbol{y}^{(\text {train })}-\boldsymbol{X}^{(\operatorname{train})} \boldsymbol{w}\right)\right) \end{aligned} p(y(train )∣X(train ),w)=N(y(train );X(train )w,I)∝exp(−21(y(train )−X(train)w)⊤(y(train )−X(train)w))
where we follow the standard MSE formulation in assuming that the Gaussian variance on y y y is one. In what follows, to reduce the notational burden, we refer to ( X ( train ) , y ( train ) ) \left(\boldsymbol{X}^{(\operatorname{train})}, \boldsymbol{y}^{(\text {train })}\right) (X(train),y(train )) as simply ( X , y ) . (\boldsymbol{X}, \boldsymbol{y}) . (X,y).
To determine the posterior distribution over the model parameter vector w \boldsymbol{w} w, we first need to specify a prior distribution. The prior should reflect our naive belief about the value of these parameters. While it is sometimes difficult or unnatural to express our prior beliefs in terms of the parameters of the model, in practice we typically assume a fairly broad distribution expressing a high degree of uncertainty about θ \theta θ. For real-valued parameters it is common to use a Gaussian as a prior distribution:
p ( w ) = N ( w ; μ 0 , Λ 0 ) ∝ exp ( − 1 2 ( w − μ 0 ) ⊤ Λ 0 − 1 ( w − μ 0 ) ) , p(\boldsymbol{w})=\mathcal{N}\left(\boldsymbol{w} ; \boldsymbol{\mu}_{0}, \boldsymbol{\Lambda}_{0}\right) \propto \exp \left(-\frac{1}{2}\left(\boldsymbol{w}-\boldsymbol{\mu}_{0}\right)^{\top} \boldsymbol{\Lambda}_{0}^{-1}\left(\boldsymbol{w}-\boldsymbol{\mu}_{0}\right)\right), p(w)=N(w;μ0,Λ0)∝exp(−21(w−μ0)⊤Λ0−1(w−μ0)),
where μ 0 \boldsymbol{\mu}_{0} μ0 and Λ 0 \boldsymbol{\Lambda}_{0} Λ0 are the prior distribution mean vector and covariance matrix respectively.With the prior thus specified, we can now proceed in determining the posterior distribution over the model parameters.
p ( w ∣ X , y ) ∝ p ( y ∣ X , w ) p ( w ) ∝ exp ( − 1 2 ( y − X w ) ⊤ ( y − X w ) ) exp ( − 1 2 ( w − μ 0 ) ⊤ Λ 0 − 1 ( w − μ 0 ) ) ∝ exp ( − 1 2 ( − 2 y ⊤ X w + w ⊤ X ⊤ X w + w ⊤ Λ 0 − 1 w − 2 μ 0 ⊤ Λ 0 − 1 w ) ) \begin{aligned} p(\boldsymbol{w} \mid \boldsymbol{X}, \boldsymbol{y}) & \propto p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{w}) p(\boldsymbol{w}) \\ & \propto \exp \left(-\frac{1}{2}(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{w})^{\top}(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{w})\right) \exp \left(-\frac{1}{2}\left(\boldsymbol{w}-\boldsymbol{\mu}_{0}\right)^{\top} \boldsymbol{\Lambda}_{0}^{-1}\left(\boldsymbol{w}-\boldsymbol{\mu}_{0}\right)\right) \\ \\ & \propto \exp \left(-\frac{1}{2}\left(-2 \boldsymbol{y}^{\top} \boldsymbol{X} \boldsymbol{w}+\boldsymbol{w}^{\top} \boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{w}+\boldsymbol{w}^{\top} \boldsymbol{\Lambda}_{0}^{-1} \boldsymbol{w}-2 \boldsymbol{\mu}_{0}^{\top} \boldsymbol{\Lambda}_{0}^{-1} \boldsymbol{w}\right)\right) \end{aligned} p(w∣X,y)∝p(y∣X,w)p(w)∝exp(−21(y−Xw)⊤(y−Xw))exp(−21(w−μ0)⊤Λ0−1(w−μ0))∝exp(−21(−2y⊤Xw+w⊤X⊤Xw+w⊤Λ0−1w−2μ0⊤Λ0−1w))
We now define Λ m = ( X ⊤ X + Λ 0 − 1 ) − 1 \boldsymbol{\Lambda}_{m}=\left(\boldsymbol{X}^{\top} \boldsymbol{X}+\mathbf{\Lambda}_{0}^{-1}\right)^{-1} Λm=(X⊤X+Λ0−1)−1 and μ m = Λ m ( X ⊤ y + Λ 0 − 1 μ 0 ) \boldsymbol{\mu}_{m}=\boldsymbol{\Lambda}_{m}\left(\boldsymbol{X}^{\top} \boldsymbol{y}+\boldsymbol{\Lambda}_{0}^{-1} \boldsymbol{\mu}_{0}\right) μm=Λm(X⊤y+Λ0−1μ0).
Using these new variables, we find that the posterior may be rewritten as a Gaussian distribution: p ( w ∣ X , y ) ∝ exp ( − 1 2 ( w − μ m ) ⊤ Λ m − 1 ( w − μ m ) + 1 2 μ m ⊤ Λ m − 1 μ m ) ∝ exp ( − 1 2 ( w − μ m ) ⊤ Λ m − 1 ( w − μ m ) ) \begin{aligned} p(\boldsymbol{w} \mid \boldsymbol{X}, \boldsymbol{y}) & \propto \exp \left(-\frac{1}{2}\left(\boldsymbol{w}-\boldsymbol{\mu}_{m}\right)^{\top} \boldsymbol{\Lambda}_{m}^{-1}\left(\boldsymbol{w}-\boldsymbol{\mu}_{m}\right)+\frac{1}{2} \boldsymbol{\mu}_{m}^{\top} \boldsymbol{\Lambda}_{m}^{-1} \boldsymbol{\mu}_{m}\right) \\ & \propto \exp \left(-\frac{1}{2}\left(\boldsymbol{w}-\boldsymbol{\mu}_{m}\right)^{\top} \boldsymbol{\Lambda}_{m}^{-1}\left(\boldsymbol{w}-\boldsymbol{\mu}_{m}\right)\right) \end{aligned} p(w∣X,y)∝exp(−21(w−μm)⊤Λm−1(w−μm)+21μm⊤Λm−1μm)∝exp(−21(w−μm)⊤Λm−1(w−μm))
All terms that do not include the parameter vector w \boldsymbol{w} w have been omitted; they are implied by the fact that the distribution must be normalized to integrate to 1. 1 . 1.
Examining this posterior distribution allows us to gain some intuition for the effect of Bayesian inference. In most situations, we set μ 0 \mu_{0} μ0 to 0 \mathbf{0} 0. If we set Λ 0 = 1 α I \boldsymbol{\Lambda}_{0}=\frac{1}{\alpha} \boldsymbol{I} Λ0=α1I, then μ m \mu_{m} μm gives the same estimate of w \boldsymbol{w} w as does frequentist linear regression with a weight decay penalty of α w ⊤ w \alpha \boldsymbol{w}^{\top} \boldsymbol{w} αw⊤w.
One difference is that the Bayesian estimate is undefined if α \alpha α is set to zero - we are not allowed to begin the Bayesian learning process with an infinitely wide prior on w \boldsymbol{w} w.
The more important difference is that the Bayesian estimate provides a covariance matrix, showing how likely all the different values of w \boldsymbol{w} w are, rather than providing only the estimate μ m \mu_{m} μm.
Maximum A Posteriori (MAP) Estimation
-
While the most principled approach is to make predictions using the full Bayesian posterior distribution over the parameter θ \theta θ, it is still often desirable to have a single point estimate.
-
One common reason for desiring a point estimate is that most operations involving the Bayesian posterior for most interesting models are intractable(顽固的), and a point estimate offers a tractable(易处理的) approximation. Rather than simply returning to the maximum likelihood estimate, we can still gain some of the benefit of the Bayesian approach by allowing the prior to influence the choice of the point estimate. One rational way to do this is to choose the maximum a posteriori (MAP) point estimate.
-
The MAP estimate chooses the point of maximal posterior probability (or maximal probability density in the more common case of continuous θ \boldsymbol{\theta} θ ):
θ M A P = arg max θ p ( θ ∣ x ) = arg max θ log p ( x ∣ θ ) + log p ( θ ) \boldsymbol{\theta}_{\mathrm{MAP}}=\underset{\boldsymbol{\theta}}{\arg \max } p(\boldsymbol{\theta} \mid \boldsymbol{x})=\underset{\boldsymbol{\theta}}{\arg \max } \log p(\boldsymbol{x} \mid \boldsymbol{\theta})+\log p(\boldsymbol{\theta}) θMAP=θargmaxp(θ∣x)=θargmaxlogp(x∣θ)+logp(θ)
We recognize, above on the right hand side, log p ( x ∣ θ ) \log p(\boldsymbol{x} \mid \boldsymbol{\theta}) logp(x∣θ), i.e. the standard loglikelihood term, and log p ( θ ) \log p(\boldsymbol{\theta}) logp(θ), corresponding to the prior distribution. -
As an example, consider a linear regression model with a Gaussian prior on the weights w \boldsymbol{w} w. If this prior is given by N ( w ; 0 , 1 λ I 2 ) \mathcal{N}\left(\boldsymbol{w} ; \mathbf{0}, \frac{1}{\lambda} \boldsymbol{I}^{2}\right) N(w;0,λ1I2), then the log-prior term in equation above is proportional to the familiar λ w ⊤ w \lambda \boldsymbol{w}^{\top} \boldsymbol{w} λw⊤w weight decay penalty, plus a term that does not depend on w \boldsymbol{w} w and does not affect the learning process. MAP Bayesian inference with a Gaussian prior on the weights thus corresponds to weight decay.
-
As with full Bayesian inference, MAP Bayesian inference has the advantage of leveraging information that is brought by the prior and cannot be found in the training data. This additional information helps to reduce the variance in the MAP point estimate (in comparison to the ML estimate). However, it does so at the price of increased bias.
-
Many regularized estimation strategies, such as maximum likelihood learning regularized with weight decay, can be interpreted as making the MAP approximation to Bayesian inference. This view applies when the regularization consists of adding an extra term to the objective function that corresponds to log p ( θ ) \log p(\theta) logp(θ). Not all regularization penalties correspond to MAP Bayesian inference. For example, some regularizer terms may not be the logarithm of a probability distribution. Other regularization terms depend on the data, which of course a prior probability distribution is not allowed to do.
-
MAP Bayesian inference provides a straightforward way to design complicated yet interpretable regularization terms. For example, a more complicated penalty term can be derived by using a mixture of Gaussians, rather than a single Gaussian distribution, as the prior.
标签:Basics,hat,boldsymbol,Machine,right,Learning,theta,data,left 来源: https://blog.csdn.net/qq_40061421/article/details/118462267