逻辑回归

代价函数(Cost Function)

对于线性回归模型,我们定义的代价函数J为:

图片 1

现在对于逻辑回归模型我们沿用此定义,但问题是hθ,而函数g为S形函数,故代价函数J将会变为像下图中左边的图那样,此时我们将其称之为非凸函数(non-convex
function)。

图片 2

注:国外凸函数定义与国内是相反的。

这意味着代价函数J存在无数个局部最小值,从而影响梯度下降算法的运行。为了将代价函数J变为像上图中右边的图那样,我们想要将代价函数J重新定义为:

图片 3

其中

图片 4

hθ与Cost之间的函数关系如下图所示:

图片 5

y=1:

  • hθ -> 1,则Cost = 0;
  • hθ -> 0,则Cost -> ∞。

y=0:

  • hθ -> 0,则Cost = 0;
  • hθ -> 1,则Cost -> ∞。
过拟合问题(The Problem of Overfitting)

图片 6

如上图所示,第一个采用单变量线性回归模型来拟合数据集,但其效果并不好,因此我们将这种情况称为欠拟合(Underfitting)或高偏差(High
Bias);第二个采用二次多项式的线性回归模型来拟合数据集,其效果恰好,因此我们将这种情况称为“Just
Right”;第三个采用四次多项式的线性回归模型来拟合数据集,其虽然对数据集拟合的非常好,但其曲线忽上忽下难以针对新数据进行预测,因此我们将这种情况称为过拟合(Overfitting)或高方差(
high variance)。

除此之外,逻辑回归模型也存在上述情况,如下图所示:

图片 7

根据在线性回归模型中的分析,我们不难得知第一个为欠拟合,第二个最合适,第三个过拟合。

现在我们来看看过拟合的定义:

图片 8

即若数据集中存在许多特征变量,我们通过使用高次方多项式来拟合数据集,其看似将数据集中的每个数据都拟合得很好,但其对于新数据的处理就无法做得很好,即泛化较差(泛化指一个假设模型能应用到新样板的能力),这时我们将其称为过拟合。

Question:
Consider the medical diagnosis problem of classifying tumors as
malignant or begin. If a hypothesis hθ(x) has overfit the
training set, it means that:
A. It makes accurate predictions for examples in the training set and
generalizes well to make accurate predictions on new, previously unseen
examples.
B. It does not make accurate predictions for examples in the training
set, but it does generalize well to make accurate predictions on new,
previously unseen example.
C. It makes accurate predictions for examples in the training set, but
it does not generalize well to make accurate predictions on new,
previously unseen examples.
D. It does not make accurate predictions for examples in the training
set and does not generalize well to make accurate predictions on new,
previously unseen examples.

根据过拟合的定义我们不难得知C为正确答案。

针对过拟合问题,我们有如下方法来解决:

  1. 减少特征变量的个数:
    • 人工选择特征变量
    • 使用模型选择算法,自动选择特征变量
  2. 正则化:保留所有特征变量,但减小参数θj的值
补充笔记
补充笔记
Cost Function

We cannot use the same cost function that we use for linear regression
because the Logistic Function will cause the output to be wavy, causing
many local optima. In other words, it will not be a convex function.

Instead, our cost function for logistic regression looks like:

图片 9

When y = 1, we get the following plot for J vs hθ:

图片 10

Similarly, when y = 0, we get the following plot for J vs hθ:

图片 11图片 12

If our correct answer ‘y’ is 0, then the cost function will be 0 if our
hypothesis function also outputs 0. If our hypothesis approaches 1, then
the cost function will approach infinity.

If our correct answer ‘y’ is 1, then the cost function will be 0 if our
hypothesis function outputs 1. If our hypothesis approaches 0, then the
cost function will approach infinity.

Note that writing the cost function in this way guarantees that J is
convex for logistic regression.

The Problem of Overfitting

Consider the problem of predicting y from x ∈ R. The leftmost figure
below shows the result of fitting a y = θ01x to
a dataset. We see that the data doesn’t really lie on straight line, and
so the fit is not very good.

图片 13

图片 14

Underfitting, or high bias, is when the form of our hypothesis function
h maps poorly to the trend of the data. It is usually caused by a
function that is too simple or uses too few features. At the other
extreme, overfitting, or high variance, is caused by a hypothesis
function that fits the available data but does not generalize well to
predict new data. It is usually caused by a complicated function that
creates a lot of unnecessary curves and angles unrelated to the data.

This terminology is applied to both linear and logistic regression.
There are two main options to address the issue of overfitting:

  1. Reduce the number of features:
    • Manually select which features to keep.
    • Use a model selection algorithm (studied later in the course).
  2. Regularization
    • Keep all the features, but reduce the magnitude of parameters
      θj.
    • Regularization works well when we have a lot of slightly useful
      features.
梯度下降算法

之前为了得到一个凸函数,我们重新定义代价函数J为:

图片 15

对于Cost函数而言,它在y=1和y=0时有不同的函数表达式。为此我们可以将其合并为如下表达式:

图片 16

因此,代价函数J可以简化为:

图片 17

既然有了上述的代价函数代价函数J,我们就要使用梯度下降算法来求得参数θ使得代价函数J最小化。

图片 18梯度下降算法

为了方便推导出迭代表达式,我们假设在单个样本下的代价函数J为:

图片 19

则:

图片 20

其中:

图片 21

注:上述推导过程摘自bitcarmanlee博主的logistic回归详解:损失函数(cost
function)详解和logistic回归详解:梯度下降训练方法这两篇文章。谢谢!

因此对于全体样本其梯度下降算法为:

图片 22

虽然该表达式和线性回归模型下的梯度下降算法表达式相同,但其hθ是不同的。逻辑回归模型下的假设函数hθ为:

图片 23

在使用梯度下降算法时,我们仍然推荐将参数θ和特征变量x向量化,其表达式为:

图片 24

代价函数(Cost Function)

若假设函数hθ(x) = θ0 + θ1x1

  • θ2x22 +
    θ3x33 +
    θ4x44,则会出现对下图数据集过拟合的情况。

图片 25

现假设所有的特征变量x都是非常重要的,因此我们不能舍弃任何一个特征变量x。为了解决这个问题,我们使用正则化的方法将参数θj的值变小。

为此我们需要将代价函数J(θ)修改为如下图所示那样:

图片 26

当我们使用梯度下降算法或其他高级算法来求得了参数θ的值来使得代价函数J(θ)最小化时,其θ3和θ4的值相比之前对新数据预测的影响要小。为什么呢?

这时因为我们通过使用正则化方法,在求得代价函数J(θ)最小化时,其θ3和θ4的值会无限接近于0。因此,假设函数hθ(x)甚至可以改写为hθ(x)
= θ0 + θ1x1 +
θ2x22

图片 27

如若某个数据集中有非常多的特征变量x且每个特征变量都非常重要,为了避免过拟合问题,我们可将代价函数J(θ)修改为:

图片 28

其中λ称为正则化参数(Regularization
Parameter)。因此,我们将这种方法称为正则化。

注:此处我们无需考虑θ0

对于正则化参数λ的选择我们也要慎重,一旦其值过大,则θ1,θ2,θ3和θ4都会无限接近于0。此时,假设函数hθ(x)甚至可以改写为hθ(x)
= θ0

图片 29

其结果如图中红线所示,这样就出现了欠拟合问题。

补充笔记
补充笔记
Simplified Cost Function and Gradient Descent

We can compress our cost function’s two conditional cases into one case:

图片 30

Notice that when y is equal to 1, then the second term log⁡ will be zero
and will not affect the result. If y is equal to 0, then the first term
−ylog⁡ will be zero and will not affect the result.

We can fully write out our entire cost function as follows:

图片 31

A vectorized implementation is:

图片 32

Gradient Descent

Remember that the general form of gradient descent is:

图片 33

We can work out the derivative part using calculus to get:

图片 34

Notice that this algorithm is identical to the one we used in linear
regression. We still have to simultaneously update all values in theta.

A vectorized implementation is:

图片 35

Cost Function

If we have overfitting from our hypothesis function, we can reduce the
weight that some of the terms in our function carry by increasing their
cost.

Say we wanted to make the following function more quadratic:

图片 36

We’ll want to eliminate the influence of θ3x3 and
θ4x4 . Without actually getting rid of these
features or changing the form of our hypothesis, we can instead modify
our cost function:

图片 37

We’ve added two extra terms at the end to inflate the cost of
θ3 and θ4. Now, in order for the cost function to
get close to zero, we will have to reduce the values of θ3
and θ4 to near zero. This will in turn greatly reduce the
values of θ3x3 and θ4x4 in
our hypothesis function. As a result, we see that the new hypothesis
(depicted by the pink curve) looks like a quadratic function but fits
the data better due to the extra small terms θ3x3
and θ4x4.

图片 38

We could also regularize all of our theta parameters in a single
summation as:

图片 39

The λ, or lambda, is the regularization parameter. It determines how
much the costs of our theta parameters are inflated.

Using the above cost function with the extra summation, we can smooth
the output of our hypothesis function to reduce overfitting. If lambda
is chosen to be too large, it may smooth out the function too much and
cause underfitting.

发表评论

电子邮件地址不会被公开。 必填项已用*标注