统计学基础,大规模机器学习

在线学习

In computer science, online machine learning is a method of machine
learning in which data becomes available in a sequential order and is
used to update our best predictor for future data at each step, as
opposed to batch learning techniques which generate the best predictor
by learning on the entire training data set at once.

假设我们拥有一家物流公司,每当一位用户咨询从地点A至地点B的所需的费用时,我们给用户一个价格,该用户可能接受该价格或不接受。

现在我们希望构建一个模型,其能预测用户接受报价的可能性。因此,我们选取起始地点、目标地点、运输距离、价格和用户数据作为特征变量。

图片 1

在线学习算法可以根据当前用户的行为不断更新模型以适应该用户。

另一个应用为:点击通过率(Click Though Rate)

图片 2

Do you want to learn statistics for data science without taking a slow
and expensive course? Good news… You can master the core concepts,
probability, Bayesian thinking, and even statistical machine learning
using only free online resources. Here are the best resources for
self-starters!

Map Reduce and Data Parallelism

之前我们运用批量梯度下降算法求解大规模数据集的最优解,其计算代价是非常大。因此,我们需要将数据集分配给多台计算机运算,让每台计算机处理数据集的一个子集,然后再将各个计算机的计算结果汇总求和。

图片 3图片 4

你是否想要学习数据科学的统计学而不用花费时间和金钱去上一个缓慢而昂贵的课程?好消息,只依靠免费的在线资源,你能够掌握核心概念,概率,贝叶斯思维,甚至是统计机器学习。这是自学的最好资源!

图片 5

图片 6

正在上传…取消)

By the way… you don’t need a math degree to succeed with this
approach. Yet, if you do have a math background, you’ll definitely enjoy
this fun, hands-on method too.

顺道说一句,要达到这一切,你不是必须有一个数学学位,但是呢,如果你拥有数学背景,你一定会享受这个愉悦的,亲自动手的方法。

This guide will equip you with the tools of statistical thinking needed
for data science. It will arm you with ahuge advantage over other
aspiring data scientists
who try to get by without it.

这篇指导会给你装备从事数据科学所需要的全部统计思维的工作。相较于其他没有此工具加持的有为的数据科学家,它将给你带来巨大的优势。

You see, it can be tempting to jump directly into using machine learning
packages once you’ve learned how to program… And you know what? It’s
ok if you want to initially get the ball rolling with real projects.

一旦你学会如何编程,你会忍不住直接使用机器学习包。如果你想直接在真实工程去实施,也是可以的。

But, you should never, ever completely skip learning statistics and
probability theory. It’s essential to progressing your career as a data
scientist.

你绝对不应该跳过学习统计学和概率论,它是数据科学家的必修课。

Pre-requisite: Basic Python Skills

先决条件:基础的python技能

To complete this guide, you’ll need at least basic Python* programming
skills. We’ll be learning statistics in an applied, hands-on way.

要会Python!要会Python!要会Python!

Check out our guide,How to Learn Python for Data Science, The
Self-Starter
Way,
for the fastest way to get up to speed with Python. We recommend at
least completing up toStep 2in that guide.

下篇会讲如何学习Python做数据科学……

note: other languages are fine too, but the examples will be in
Python.*

看这里:其他语言也是可以的,只是这里的例子用的是Python

Statistics Needed for Data Science

数据科学用到的统计学知识

Statistics is a broad field with applications in many industries.

统计学在众多领域都有着广泛的应用。

Wikipedia defines it asthe study of the collection, analysis,
interpretation, presentation, and organization of data
. Therefore, it
shouldn’t be a surprise that data scientists need to know statistics.

维基百科定义:是一门关于数据的收集、分析、预测、呈现与组织的学问。故,数据科学家是一定要会统计学的!

图片 7

图片 8

正在上传…取消)

For example, data analysis requires descriptive statistics and
probability theory, at a minimum. These concepts will help you make
better business decisions from data.

举个栗子,最低程度上讲,数据分析也需要描述统计学和概率论。这些概念能够帮助你从数据中做出更明智的决定。

Key concepts includeprobability distributions,statistical
significance
,hypothesis testing, andregression.

关键概念有:概率分布、统计意义、假设检验和回归。

Furthermore, machine learning requires understanding Bayesian thinking.
Bayesian thinking is the process of updating beliefs as additional data
is collected, and it’s the engine behind many machine learning models.

往远了说,机器学习要求了解贝叶斯思维。贝叶斯思维是在收集额外数据时更新信念的过程,它是许多机器学习模型背后的引擎。

Key concepts includeconditional probability,priors and
posteriors
, andmaximum likelihood.

关键概念包括条件概率、先验和后验,以及最大可能性(最大似然)。

If those terms sound like mumbo jumbo to you, don’t worry. This will all
make sense once you roll up your sleeves and start learning.

如果这些术语让你听起来像mumbo
jumbo,表担心。一旦你撸起袖子开始学习,这些都是有意义的。

The Best Way to Learn to Statistics for Data Science

学习数据科学之统计学最佳途径

By now, you’ve probably noticed that one common theme in “the
self-starter way to learning X” is to skip classroom instruction
andlearn by “doing shit.”

此处省略若干字,舍弃也是一种得到……

Mastering statistics for data science is no exception.

这里也不例外,Doing shit……

In fact, we’re going to tackle key statistical concepts by programming
them with code! Trust us… this will be super fun.

事实上,我们将通过编程来处理关键的统计概念!相信我们…这将是非常有趣的。

If you do not have formal math training, you’ll find this approach much
more intuitive than trying to decipher complicated formulas. It allows
you to think through the logical steps of each calculation.

如果你没有正规的数学训练,你会发现这种方法比解复杂的公式更直观。它允许您考虑每个计算的逻辑步骤。(有道翻译很强大嘛!)

If you do have a formal math background, this approach will help you
translate theory into practice and give you some fun programming
challenges.

如果你有一个正式的数学背景,这个方法将帮助你把理论转化为实践,并带给你一些有趣的编程挑战。

Here are the 3 steps to learning the statistics and probability required
for data science:

以下是学习数据科学所需的统计学和概率的三个步骤:

1 Core Statistics Concepts  统计学核心概念

Descriptive statistics, distributions, hypothesis testing, and
regression.

描述统计、分布、假设检验和回归。

2 Bayesian Thinking  贝叶斯思维

Conditional probability, priors, posteriors, and maximum likelihood.

条件概率,先验,后验,和最大可能性。

3 Intro to Statistical Machine Learning  机器学习统计概论

Learn basic machine concepts and how statistics fits in.

学习基本的机器概念及统计学的介入。

After completing these 3 steps, you’ll be ready to attack more
difficult machine learning problems and common real-world applications
of data science.

完成这3个步骤后,您将准备好迎击更困难的机器学习问题和数据科学的常见应用程序。

Step 1: Core Statistics Concepts

核心概念

To know how to learn statistics for data science, it’s helpful to start
by looking at how it will be used.

首先来看下统计学是如何被使用的,有益于后续学习

Let’s take a look as some examples of real analyses or applications you
might need to implement as a data scientist:

让我们来看看作为数据科学家需要实现的实际分析或应用程序的一些示例:

Experimental design:Your company is rolling out a new product line,
but it sells through offline retail stores. You need to design an A/B
test that controls for differences across geographies. You also need to
estimate how many stores to pilot in for statistically significant
results.

实验设计:你的公司正在推出一条新的产品线,但它通过线下零售商店销售。你需要设计一个A/B测试来控制不同地区的差异。您还需要估计有多少商店在统计上有显著的结果。

Regression modeling:Your company needs to better predict the demand
of individual product lines in its stores. Under-stocking and
over-stocking are both expensive. You consider building a series of
regularized regression models.

回归建模:您的公司需要更好地预测其商店中单个产品线的需求。库存不足和库存过多都是昂贵的。您可以考虑构建一系列规范化的回归模型。

Data transformation:You have multiple machine learning model
candidates you’re testing. Several of them assume specific
probability distributions of input data, and you need to be able to
identify them and either transform the input data appropriatelyorknow
when underlying assumptions can be relaxed.

发表评论

电子邮件地址不会被公开。 必填项已用*标注