网格搜索 Grid Search



以使用 KNN 给 digits 数据集分类为例:

Python 原生代码实现寻找最佳超参数

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
digits = datasets.load_digits() 
X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=2) 
from sklearn.neighbors import KNeighborsClassifier

使用 k 作为超参数

best_score = 0.0
best_k = -1
for k in range(1,11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train) 
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_score = score
        best_k = k
print(best_k, best_score)
# 1 0.9861111111111112

# 如果最好的值是边界值,如10,则最好对 10 以上的数据再进行搜索。

超参数 添加距离 weights

best_score = 0.0
best_k = -1
best_method = ""

for method in ['uniform', 'distance']:

    for k in range(1,11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        knn_clf.fit(X_train, y_train) 
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_score = score
            best_k = k
            best_method = method
print(best_k, best_method, best_score)
# 1 uniform 0.9861111111111112

超参数 添加距离范式 p

p 默认为2,即使用 欧氏距离。

%%time # 距离需要开根号,比较耗时,这里计时
best_score = 0.0
best_k = -1
best_p = -1

for p in range(1, 6):

    for k in range(1,11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights='distance', p=p)
        knn_clf.fit(X_train, y_train) 
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_score = score
            best_k = k
            best_p = p
print(best_k, best_p, best_score)
1 2 0.9861111111111112
    CPU times: user 14.8 s, sys: 46.7 ms, total: 14.9 s
    Wall time: 14.9 s

以上搜索方式也称为 网格搜索。

使用 sklearn 中的网格搜索

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
digits = datasets.load_digits() 
X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=2) 
from sklearn.neighbors import KNeighborsClassifier
# 定义要搜索的参数

param_grid = [{'weights': ['uniform'],
               'n_neighbors': [i for i in range(1,11)]
              {'weights': ['distance'],
               'n_neighbors': [i for i in range(1,11)],
               'p': [i for i in range(1,6)]
knn_clf = KNeighborsClassifier()

from sklearn.model_selection import GridSearchCV
# CV 的意思是 Cross Validation,交叉验证。

grid_search = GridSearchCV(knn_clf, param_grid)
# 比较耗时,
grid_search.fit(X_train, y_train)

# CPU times: user 43.3 s, sys: 93.2 ms, total: 43.4 s
# Wall time: 43.5 s

GridSearchCV(cv='warn', error_score='raise-deprecating',
                 estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                                metric_params=None, n_jobs=None,
                                                n_neighbors=5, p=2,
                 iid='warn', n_jobs=None,
                 param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                              'weights': ['uniform']},
                             {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                              'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
                 pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
                 scoring=None, verbose=0) 

grid_search.best_estimator_  # 最佳分类器对应的参数
# KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=1, p=2, weights='uniform')

# 最佳准确度
# 0.9846903270702854
# 最佳参数
# {'n_neighbors': 1, 'weights': 'uniform'}
# 以上属性末尾都有下划线,代表一个原则:不是由用户传入的数据,而是类自己计算的结果,命名都是 名字后跟一个下划线。

# 将最佳模型传给这个 knn
knn_clf = grid_search.best_estimator_
array([4, 0, 9, 1, 8, 7, 1, 5, 1, 6, 6, 7, 6, 1, 5, 5, 7, 6, 2, 7, 4, 6, 1, 5, 2, 9, 5, 4, 6, 5, 6, 3, 4, 0, 9, 9, 8, 4, 6, 8, 8, 5, 7, ... 5, 7, 8, 0, 4, 1, 4, 5])
knn_clf.score(X_test, y_test)
# 0.9861111111111112


# 以上搜索过程是可以并行处理的;n_jobs 决定了为计算机分配几个核来处理,默认为1,代表单核;传-1代表传所有核。
# verbose 表示在搜索过程中进行输出,这样在长时间搜索的时候,可以了解搜索状态。传入整数,整数越大,输出信息越详细。
grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

    Fitting 3 folds for each of 60 candidates, totalling 180 fits
   ~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
      warnings.warn(CV_WARNING, FutureWarning)
    [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
    [Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.3s
    [Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:    8.9s
    [Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:   11.1s finished 
    GridSearchCV(cv='warn', error_score='raise-deprecating',
                 estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                                metric_params=None, n_jobs=None,
                                                n_neighbors=1, p=2,
                 iid='warn', n_jobs=-1,
                 param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                              'weights': ['uniform']},
                             {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                              'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
                 pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
                 scoring=None, verbose=2)



KNeighborsClassifier 中默认使用闵式距离,p为2(欧式距离);可以使用 metric 参数修改距离;

sklearn 官网文档列出了不同的距离

Metrics intended for real-valued vector spaces:

identifier class name args distance function
“euclidean” EuclideanDistance sqrt(sum((x - y)^2))
“manhattan” ManhattanDistance sum(|x - y|)
“chebyshev” ChebyshevDistance max(|x - y|)
“minkowski” MinkowskiDistance p sum(|x - y|^p)^(1/p)
“wminkowski” WMinkowskiDistance p, w sum(|w * (x - y)|^p)^(1/p)
“seuclidean” SEuclideanDistance V sqrt(sum((x - y)^2 / V))
“mahalanobis” MahalanobisDistance V or VI sqrt((x - y)' V^-1 (x - y))

Metrics intended for two-dimensional vector spaces: Note that the haversine distance metric requires data in the form of [latitude, longitude] and both inputs and outputs are in units of radians.

identifier class name distance function
“haversine” HaversineDistance 2 arcsin(sqrt(sin^2(0.5*dx) + cos(x1)cos(x2)sin^2(0.5*dy)))

Metrics intended for integer-valued vector spaces: Though intended for integer-valued vectors, these are also valid metrics in the case of real-valued vectors.

identifier class name distance function
“hamming” HammingDistance N_unequal(x, y) / N_tot
“canberra” CanberraDistance sum(|x - y| / (|x| + |y|))
“braycurtis” BrayCurtisDistance sum(|x - y|) / (sum(|x|) + sum(|y|))

Metrics intended for boolean-valued vector spaces: Any nonzero entry is evaluated to “True”. In the listings below, the following abbreviations are used:

  • N : number of dimensions
  • NTT : number of dims in which both values are True
  • NTF : number of dims in which the first value is True, second is False
  • NFT : number of dims in which the first value is False, second is True
  • NFF : number of dims in which both values are False
  • NNEQ : number of non-equal dimensions, NNEQ = NTF + NFT
  • NNZ : number of nonzero dimensions, NNZ = NTF + NFT + NTT
identifier class name distance function
“jaccard” JaccardDistance NNEQ / NNZ
“matching” MatchingDistance NNEQ / N
“dice” DiceDistance NNEQ / (NTT + NNZ)
“kulsinski” KulsinskiDistance (NNEQ + N - NTT) / (NNEQ + N)
“rogerstanimoto” RogersTanimotoDistance 2 * NNEQ / (N + NNEQ)
“russellrao” RussellRaoDistance NNZ / N
“sokalmichener” SokalMichenerDistance 2 * NNEQ / (N + NNEQ)
“sokalsneath” SokalSneathDistance NNEQ / (NNEQ + 0.5 * NTT)

User-defined distance:

identifier class name args
“pyfunc” PyFuncDistance func

