首页 > 其他分享> > Clustering -- THU 机器学习 2020

Clustering -- THU 机器学习 2020

2021-01-12 14:02:16 作者：互联网

What can we do with unlabeled data?

Data clustering
- Partition examples into groups when no pre-defined categories/classes are available
Dimensionality reduction
- Reduce the number of variables under consideration
Outlier detection
- Identification of new or unknown data or signal that a machine learning system is not aware of during training
Modeling the data density

“Birds of a feather flock together. ”
small intra-cluster distance
large inter-cluster distance
Soft clustering vs. hard clustering
- Soft: same object can belong to different clusters
- Hard: same object can only belong to single cluster

Agglomerative (层次凝聚式聚类)

凝聚式层次聚类算法：

cluster similarity:

Divisive (层次划分式聚类)

discussion on hierarchical clustering

步骤：

(step 1)

(step2)

K-means 一定能收敛，但不一定是最优解

How can we decide K?

discussion on K-means:

与k-means不同的是，k-中值clustering的"中心点"必须是一个真实存在的点，而不能是一个虚拟的"中心点"。这个真实存在的点应该是该聚类里到其他点距离之和最小的那个点。

The basic strategy：

first arbitrarily find a representative object (medoid) for each cluster
Iteration:
- Each remaining object is clustered with the medoid to which it is the most similar
- Replaces one of the medoids by one of the non-medoids as long as the quality of the resulting clustering is improved (The quality of the cluster is estimated by a cost function: the average dissimilarity(object, the medoid))

标签：Clustering,clustering,means,--,THU,object,cluster,medoid,聚类
来源： https://blog.csdn.net/weixin_41332009/article/details/112480389