首页 > 其他分享> > Statistics概念学习-第二章

Statistics概念学习-第二章

2021-01-01 05:01:06 作者：互联网

第二章 data and sampling distributions

抽样和分布

sample : 抽样集
- subset from a larger dataset
population:
- the larger data set of a dataset 众集
random sampling:随机抽样
- draw elements into a sample at random
- 其中随机抽样里面还有又放回和无放回的抽样
- data quality > data quantity
stratified sampling: 分层抽样,strata层
- devide the population into strata and randomly sample from each strata
simple random sample: 不是分层抽样
- random sampling without stratifying the population
sample bias: 坏的样本，无法描述整体的样本
- sample that misrepresents the population
self-selection sampling bias
- 举的例子是：商品的评论reviews一般来说这个不能反应商品的真实水平，是因为大家写商品评论的动力比起商品多么多么好而言，更多的是因为收到的瑕疵品而对于商品有不满
statistical bias: 可以等同于理解为“系统误差”
- systematic measurements/sampling errors 测量/抽样误差
分层抽样的例子：
- 政客调查支持率的时候会分人种:white,black,hispanics（西班牙裔的）
sample mean vs. population mean
- 抽样平均和众集平均是两个不同的概念

regression to the mean: 连续观测会导致极端观测值之后的观测值会更偏向于mean
- involving successive measurements, extreme observations tend to be followed by more central ones
- 举例：rookie of the year, sophomore slump 年度最佳球员，第二年打回原形

bootstrap: 有放回抽样
- 从已知的抽样sample set中，对自己再进行有放回的抽样，重新组成一个的抽样集
- 其中有放回的抽样还可以等效理解为：将自己sample set 自我复制几百万次然后无放回抽样
bootstrap 算法：
1. 抽样，记录，放回
2. 重复n次
3. 记录这n次的mean,std,median等参数
4. 重复1-3 R次
5. 用R 来：
  - 计算这R个数据的std ----> 就是抽样过程的error/bias
  - boxplot/histiogram
  - 找confidence interval
resampling vs. bootstrapping
- 可以理解为同义词，只不过resampling多一个permutation procedures
- 好处1是可以不用拟合复杂的模型就可以得到std
- 好处2是多个bootstrap集合（bagging）可以比单个模型效果好

bootstrap来计算置信区间：
1. 有放回的的取n个data
2. 计算这些data的特征数据:mean/std
3. 重复1-2 R次
4. 对于一个x%的置信区间，我们从两边减掉[(1-[x/100])/2]%的sample
5. 剪枝点就是置信区间的两端点
置信区间的目的：
- 用区间来描述估计
- 看我们能tolerate多少
- 来判断我们estimate有没有可能的error
- 看看是不是需要larger sample

90%的confidence interval在：
- x±t_(n-1)(.05)*(s/n) ----> s:std,n:sample size,t_(n-1):t-statistic

如果事件是要么是或者否，那么可以被Model成二项分布

exponential distribution：指数分布
- 时间/地点距离的分布，从一点到另一点,one event to the next event
weibull distribution
- generalized version of exponential, event rate is allowed to shift over time
lambda
- the rate at which events occur 事件发生的速率

标签：std,抽样,Statistics,sampling,sample,key,概念学习,distribution,第二章
来源： https://www.cnblogs.com/niemand-01/p/14219311.html