首页 > 其他分享> > 网格搜索 Grid Search

网格搜索 Grid Search

2021-02-02 07:33:32 作者：互联网

Python 原生代码实现寻找最佳超参数
使用 sklearn 中的网格搜索
- 提升效率
关于距离

以使用 KNN 给 digits 数据集分类为例：

Python 原生代码实现寻找最佳超参数

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
 
digits = datasets.load_digits() 
 
X = digits.data
y = digits.target
 
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=2) 
 
from sklearn.neighbors import KNeighborsClassifier

使用 k 作为超参数


best_score = 0.0
best_k = -1
for k in range(1,11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train) 
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_score = score
        best_k = k
    
print(best_k, best_score)
# 1 0.9861111111111112

# 如果最好的值是边界值，如10，则最好对 10 以上的数据再进行搜索。

超参数添加距离 weights


best_score = 0.0
best_k = -1
best_method = ""

for method in ['uniform', 'distance']:

    for k in range(1,11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        knn_clf.fit(X_train, y_train) 
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_score = score
            best_k = k
            best_method = method
    
print(best_k, best_method, best_score)
# 1 uniform 0.9861111111111112

超参数添加距离范式 p

p 默认为2，即使用欧氏距离。

%%time # 距离需要开根号，比较耗时，这里计时
best_score = 0.0
best_k = -1
best_p = -1

for p in range(1, 6):

    for k in range(1,11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights='distance', p=p)
        knn_clf.fit(X_train, y_train) 
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_score = score
            best_k = k
            best_p = p
    
print(best_k, best_p, best_score)
'''
1 2 0.9861111111111112
    CPU times: user 14.8 s, sys: 46.7 ms, total: 14.9 s
    Wall time: 14.9 s
'''

以上搜索方式也称为网格搜索。

使用 sklearn 中的网格搜索

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
 
digits = datasets.load_digits() 
 
X = digits.data
y = digits.target
 
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=2) 
 
from sklearn.neighbors import KNeighborsClassifier

# 定义要搜索的参数

param_grid = [{'weights': ['uniform'],
               'n_neighbors': [i for i in range(1,11)]
              },
              
              {'weights': ['distance'],
               'n_neighbors': [i for i in range(1,11)],
               'p': [i for i in range(1,6)]
              }]
 
knn_clf = KNeighborsClassifier()

from sklearn.model_selection import GridSearchCV
# CV 的意思是 Cross Validation，交叉验证。

grid_search = GridSearchCV(knn_clf, param_grid)

%%time 
# 比较耗时，
grid_search.fit(X_train, y_train)

# CPU times: user 43.3 s, sys: 93.2 ms, total: 43.4 s
# Wall time: 43.5 s

'''
GridSearchCV(cv='warn', error_score='raise-deprecating',
                 estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                                metric='minkowski',
                                                metric_params=None, n_jobs=None,
                                                n_neighbors=5, p=2,
                                                weights='uniform'),
                 iid='warn', n_jobs=None,
                 param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                              'weights': ['uniform']},
                             {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                              'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
                 pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
                 scoring=None, verbose=0) 
'''

grid_search.best_estimator_  # 最佳分类器对应的参数
# KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=1, p=2, weights='uniform')

 
# 最佳准确度
grid_search.best_score_
# 0.9846903270702854
 
# 最佳参数
grid_search.best_params_  
# {'n_neighbors': 1, 'weights': 'uniform'}
 
# 以上属性末尾都有下划线，代表一个原则：不是由用户传入的数据，而是类自己计算的结果，命名都是 名字后跟一个下划线。

# 将最佳模型传给这个 knn
knn_clf = grid_search.best_estimator_
 
knn_clf.predict(X_test) 
'''
array([4, 0, 9, 1, 8, 7, 1, 5, 1, 6, 6, 7, 6, 1, 5, 5, 7, 6, 2, 7, 4, 6, 1, 5, 2, 9, 5, 4, 6, 5, 6, 3, 4, 0, 9, 9, 8, 4, 6, 8, 8, 5, 7, ... 5, 7, 8, 0, 4, 1, 4, 5])
'''
 
knn_clf.score(X_test, y_test)
# 0.9861111111111112

提升效率


# 以上搜索过程是可以并行处理的；n_jobs 决定了为计算机分配几个核来处理，默认为1，代表单核；传-1代表传所有核。
# verbose 表示在搜索过程中进行输出，这样在长时间搜索的时候，可以了解搜索状态。传入整数，整数越大，输出信息越详细。
grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2)
 
grid_search.fit(X_train, y_train)

''' 
    Fitting 3 folds for each of 60 candidates, totalling 180 fits
 
   ~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
      warnings.warn(CV_WARNING, FutureWarning)
    [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
    [Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.3s
    [Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:    8.9s
    [Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:   11.1s finished 
    
    GridSearchCV(cv='warn', error_score='raise-deprecating',
                 estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                                metric='minkowski',
                                                metric_params=None, n_jobs=None,
                                                n_neighbors=1, p=2,
                                                weights='uniform'),
                 iid='warn', n_jobs=-1,
                 param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                              'weights': ['uniform']},
                             {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                              'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
                 pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
                 scoring=None, verbose=2)
'''

关于距离

机器学习中的距离

KNeighborsClassifier 中默认使用闵式距离，p为2（欧式距离）；可以使用 metric 参数修改距离；

sklearn 官网文档列出了不同的距离
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

Metrics intended for real-valued vector spaces:

identifier	class name	args	distance function
“euclidean”	EuclideanDistance		`sqrt(sum((x - y)^2))`
“manhattan”	ManhattanDistance		`sum(\|x - y\|)`
“chebyshev”	ChebyshevDistance		`max(\|x - y\|)`
“minkowski”	MinkowskiDistance	p	`sum(\|x - y\|^p)^(1/p)`
“wminkowski”	WMinkowskiDistance	p, w	`sum(\|w * (x - y)\|^p)^(1/p)`
“seuclidean”	SEuclideanDistance	V	`sqrt(sum((x - y)^2 / V))`
“mahalanobis”	MahalanobisDistance	V or VI	`sqrt((x - y)' V^-1 (x - y))`

Metrics intended for two-dimensional vector spaces: Note that the haversine distance metric requires data in the form of [latitude, longitude] and both inputs and outputs are in units of radians.

identifier	class name	distance function
“haversine”	HaversineDistance	`2 arcsin(sqrt(sin^2(0.5dx) + cos(x1)cos(x2)sin^2(0.5dy)))`

Metrics intended for integer-valued vector spaces: Though intended for integer-valued vectors, these are also valid metrics in the case of real-valued vectors.

identifier	class name	distance function
“hamming”	HammingDistance	`N_unequal(x, y) / N_tot`
“canberra”	CanberraDistance	`sum(\|x - y\| / (\|x\| + \|y\|))`
“braycurtis”	BrayCurtisDistance	`sum(\|x - y\|) / (sum(\|x\|) + sum(\|y\|))`

Metrics intended for boolean-valued vector spaces: Any nonzero entry is evaluated to “True”. In the listings below, the following abbreviations are used:

N : number of dimensions

NTT : number of dims in which both values are True

NTF : number of dims in which the first value is True, second is False

NFT : number of dims in which the first value is False, second is True

NFF : number of dims in which both values are False

NNEQ : number of non-equal dimensions, NNEQ = NTF + NFT

NNZ : number of nonzero dimensions, NNZ = NTF + NFT + NTT

identifier	class name	distance function
“jaccard”	JaccardDistance	NNEQ / NNZ
“matching”	MatchingDistance	NNEQ / N
“dice”	DiceDistance	NNEQ / (NTT + NNZ)
“kulsinski”	KulsinskiDistance	(NNEQ + N - NTT) / (NNEQ + N)
“rogerstanimoto”	RogersTanimotoDistance	2 * NNEQ / (N + NNEQ)
“russellrao”	RussellRaoDistance	NNZ / N
“sokalmichener”	SokalMichenerDistance	2 * NNEQ / (N + NNEQ)
“sokalsneath”	SokalSneathDistance	NNEQ / (NNEQ + 0.5 * NTT)

User-defined distance:

identifier	class name	args
“pyfunc”	PyFuncDistance	func

标签：neighbors,knn,Search,score,网格,train,Grid,test,best
来源： https://www.cnblogs.com/devwalks/p/14360138.html