Lecture 15 Anomaly detection

problem motivation

Gaussian distribution


Density estimation

Training set : \({x^{(1)},x^{(2)},\dots,x^{(m)}}\)

Each example is \(x \in \mathbb{R}^n\)

\[x_1 \sim N(u_1,\sigma_1^2)\\ x_2 \sim N(u_2,\sigma_2^2)\\ \vdots\\ x_n \sim N(u_n,\sigma_n^2)\\ p(x) = p(x_1,u_1,\sigma_1^2)p(x_2,u_2,\sigma_2^2)\dots p(x_n,u_n,\sigma_n^2) =\prod_{j=1}^p(x_j,u_j,\sigma_j^2) \]

Anomaly detection algorithm

  1. Choose features \(x_i\) that you think might be indicative of anomalous example

  2. Fit parameters \(u_1,u_2,\dots,u_n,\sigma_1^2,\sigma_2^2,\dots,\sigma_n^2\)



  3. Given new example \(x\),compute \(p(x)\)

    \[p(x)=\prod_{j=1}^np(x_j,u_j,\sigma_j^2)=\prod_{j=1}^n\frac{1}{\sqrt{2\pi}\sigma_j}exp(-\frac{(x_j-u_j)^2}{2\sigma^2_j}) \]

    Anomaly if \(p(x)<\varepsilon\)

Anomaly detection example


\(x_1\sim N(5,4)\),\(x_2\sim N(3,1)\)

对于测试样本\(x_{test}^{(1)}\),其坐标为\((a,b)\),计算其落在横坐标\(a\)上的概率\(p(a,u_1,\sigma_1^2)\),落在纵坐标\(b\)上的概率\(p(b,u_2,\sigma_2^2)\)。\(p(x) = p(a,u_1,\sigma_1^2)p(b,u_2,\sigma_2^2)\)就是测试样本\(x_{test}^{(1)}\)落在平面上点\((a,b)\)的概率。因为\(p(x_{test}^{(1)})=0.0426 \geq \varepsilon\) 所以判定其为正常样本。

同理,\(p(x_{test}^{(2)})=0.0021 \leq \varepsilon\) 所以判定其为异常样本

Developing and evaluating an anomaly detection system

The inportance of real-number evaluation

When developing a learning algorithm (choosing features,etc.),making decisions is much easier if we have a way of evaluating our learning algorithm

Assume we have some labeld data,of anomalous and non-anomalous example(y = 0 if normal,y = 1 if anomalous)

Training set:\(x^{(1)},x^{(2)},\dots,x^{(m)}\) (assume normal examples/not anomalous).把训练集是正常样本的集合,即使溜进了一些异常样本。

Cross validation set: \((x_{cv}^{(1)},y_{cv}^{(1)}),\dots,(x_{cv}^{(m_{cv})},y_{cv}^{(m_{cv})})\)

Test set : \({(x_{test}^{(1)},y_{test}^{(1)}),\dots,(x_{test}^{(m_{test})},y_{test}^{(m_{test})})}\)


Aircraft engines motivating example

10000 good (normal) engines

20 flawd engines (anomalous)

Training set : 6000 good engines (y = 0)

CV : 2000 good engines (y = 0),10 anomalous (y = 1)

Test : 2000 good engines (y = 0),10 anomalous (y = 1)

Alternative : (不推荐)

Training set : 6000 good engines (y = 0)

CV : 4000 good engines (y = 0),10 anomalous (y = 1)

Test : 4000 good engines (y = 0),10 anomalous (y = 1)


Algorithm evaluation

Fit model \(p(x)\) on training set \(\{x^{(1)},\dots,x^{(m)}\}\)

On a cross validation/test example \(x\),predict

\[y = \begin{cases} 1 & if\ p(x)<\varepsilon\ (anomaly)\\ 0 & if\ p(x)\geq\varepsilon\ (normal) \end{cases} \]

possible evaluation metrics:

Can also use cross validation set to choose parameter \(\varepsilon\)

尝试许多不同的 \(\varepsilon\),找到使得\(F_1\)-score最大的\(\varepsilon\)

Anomaly detection vs. supervised learning

Anomaly detection Supervised learning
vary small number of positive example (y = 1).(0-20 is common) Large number of negative (y = 0) examples Laege number of positive and negative examples
Many different “types" of anomalies.Hard for any algorithm to learn from positive examples what the anomalies look like ; future anomalies may look nothing like any of the anomalous examples we've seen so far Ecough positive examples for algorithm to get a sense of what positive examples are like,future positive examples likely to be similar to ones in training set





Anomaly detection Supervised learning
Frad detection Email spam classification
Manufacturing (e.g. aircraft engines) Wheater prediction
Monitoring machines in a data center Cancer classification

Choosing what features to use

Non-gaussian features




\[\begin{align} & x_1 \leftarrow log(x_1)\\ & x_2 \leftarrow log(x_2 + c)\\ & x_3 \leftarrow \sqrt{x_3}\\ & x_4 \leftarrow x_4^{\frac{1}{3}} \end{align} \]

Error analysis for anomaly detection

Want \(p(x)\) large for normal examples \(x\)

​ \(p(x)\) small for anomalous examples \(x\)

Most common problem : \(p(x)\) is comparable (say,both large) for normal and anomalous examples.



Monitoring computers in a data center

Choose features that might take on unusually large or small values in the event of an anomaly

\(x_1\)=memory use of computer

\(x_2\)=number of disk accesses/sec

\(x_3\)=CPU load

\(x_4\)=network traffic

\(x_5\)=CPU load / network traffic

\(x_6\)=(CPu load)^2 / network traffic

Multivariate Gaussian distribution

Motivating example:Monitoring machines in a data center



Multivariate Gaussian (Normal) distribution

\(x\in\mathbb{R}^n\). Don't model \(p(x_1),p(x_2),\dots\),etc. separately.

Model \(p(x)\) all in one go.

Parameters : \(u \in \mathbb{R}^n\),\(\Sigma \in \mathbb{R}^{n\times n}\)(covariance matrix)

\[p(x;u,\Sigma)=\frac{1}{(2\pi)^{(n/2)}|\Sigma|^{\frac{1}{2}}}exp(-\frac{1}{2}(x-u)^T\Sigma^{-1}(x-u)) \]

Parameter fitting:

Given training set \(\{x^{(1)},x^{(2)},\dots,x^{(m)}\}\)

\[u=\frac{1}{m}\sum_{i=1}^mx^{(i)}\ \ \Sigma=\frac{1}{m}\sum_{i=1}^m(x^{(i)}-u)(x^{(i)}-u)^T \]

Multivariance Gaussian (Normal) examples







Anomaly detection using the multivariance Gaussian distribution

Relationship to origin model

Original model : \(p(x) = p(x_1,u_1,\sigma_1^2)p(x_2,u_2,\sigma_2^2)\dots p(x_n,u_n,\sigma_n^2)\)


Corresponds to multivariate Gaussian

\[p(x;u,\Sigma)=\frac{1}{(2\pi)^{(n/2)}|\Sigma|^{\frac{1}{2}}}exp(-\frac{1}{2}(x-u)^T\Sigma^{-1}(x-u)) \]

Where \(\Sigma = \left[\begin{matrix}\sigma^2_1&0&\cdots&0\\0&\sigma^2_2&\cdots&0\\\vdots&\vdots&\ddots&\vdots\\0&0&\cdots&\sigma^2_n\end{matrix}\right]\)

原模型的等高线图总是轴对齐的 (axis-aligned)。(变量之间线性无关。)

Original model Multivariate Gaussian
Manually create features to capture anomalies where \(x_1,x_2\) take unusual combinations of values.e.g.\(x_3=\frac{x_1}{x_2}=\frac{CPU\ load}{memory}\) Automaticlly capture correlations between features
Computationally cheaper(alternatively,scales better to large) Computationally more expensive
OK even if \(m\) (training set size) is small Must have \(m>n\) or else \(\Sigma\) is non-invertible (\(m \geq 10n\))

