首页 > 系统相关> > 《Web安全之机器学习入门》笔记：第七章 7.3朴素贝叶斯检测WebShell（一）

《Web安全之机器学习入门》笔记：第七章 7.3朴素贝叶斯检测WebShell（一）

2022-01-30 23:58:31 作者：互联网

1.源码修改

（1）报错

UnicodeDecodeError: 'gbk' codec can't decode byte 0x9a in position 8: illegal multibyte sequence

Load ../data/PHP-WEBSHELL/xiaoma/1148d726e3bdec6db65db30c08a75f80.php
Traceback (most recent call last):
......
  t=load_file(file_path)
  for line in f:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x9a in position 8: illegal multibyte sequence

将代码改为

def load_file(file_path):
    t=""
    with open(file_path,encoding='utf-8') as f:
        for line in f:
            line=line.strip('\n')
            t+=line
    return t

（2）报错2：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 15: invalid start byte

Load ../data/PHP-WEBSHELL/xiaoma/6b2548e859dd00dbf9e11487597b2c06.php
Traceback (most recent call last): 
    t=load_file(file_path)
    for line in f:
  File "C:\ProgramData\Anaconda3\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 15: invalid start byte

报这个错的话，将这个文件另存为，改为utf-8编码

2.数据集处理之黑白样本获取

本节使用的数据集是在互联网搜集到的黑样本，也就是各种大马和小马的集合。

打开小马的目录，可以看到有54个php后缀的小马文件

打开一个文件，可以看到内容为一句话木马

样本应包括黑样本和白羊吧，对于基于Webshell的文本特征进行WebShell的检测，上文提到本文采用在互联网上搜集到的Webshell作为黑样本，那么白样本则是采用当前最新的wordpress源码，如下所示为白样本

3.样本向量化

在本文中php后缀的文件为黑白样本，需要将其转换为向量的方式。将一个PHP文件作为一个字符串处理，以基于单词2-gram切割，遍历全部文件形成基于2-gram的词汇表。然后进一步将每个PHP文件向量化

webshell的的思路为，将php webshell文件按照单词分词后(正则r'\b\w+\b')，按照2-gram算法得到词集，从而得到文件每一行在该词集上的分布情况，得到特征向量；然后将正常的php文件也按照如上方法在如上词集上得到特征向量。

（1）何为N-gram与2-gram

N-gram是机器学习中NLP处理中的一个较为重要的语言模型，它的基本思想是将文本里面的内容按照字节进行大小为N的滑动窗口操作，形成了长度是N的字节片段序列。n-gram模型是指n个连续的单词组成的序列。N=1时称为unigram，N=2称为bigram，N=3称为trigram，以此类推。

该模型基于这样一种假设，第N个词的出现只与前面N-1个词相关，而与其它任何词都不相关，整句的概率就是各个词出现概率的乘积。这些概率可以通过直接从语料中统计N个词同时出现的次数得到。常用的是二元的Bi-Gram和三元的Tri-Gram。

（2）黑样本

代码如下：

    webshell_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",
                                        token_pattern = r'\b\w+\b',min_df=1)
    webshell_files_list=load_files("../data/PHP-WEBSHELL/xiaoma/")
    x1=webshell_bigram_vectorizer.fit_transform(webshell_files_list).toarray()
    print(len(x1), x1[0])
    y1=[1]*len(x1)

打印feature

print(webshell_bigram_vectorizer.get_feature_names())

结果如下：

打印vocabulary

    vocabulary=webshell_bigram_vectorizer.vocabulary_

内容如下所示

（3）白样本

代码如下

    vocabulary=webshell_bigram_vectorizer.vocabulary_
    wp_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), 
decode_error="ignore", token_pattern = r'\b\w+\b',min_df=1,vocabulary=vocabulary)
    wp_files_list=load_files("../data/wordpress/")
    x2=wp_bigram_vectorizer.fit_transform(wp_files_list).toarray()
    print(len(x2), x2[0])
    y2=[0]*len(x2)

（4）构造训练集

代码如下

    x=np.concatenate((x1,x2))
    y=np.concatenate((y1, y2))

5.完整代码如下：

基本运行环境为python3，如下为修改过可以正常运行的源码

import os
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn import model_selection
from sklearn.naive_bayes import GaussianNB


def load_file(file_path):
    t=""
    with open(file_path, encoding='utf-8') as f:
        for line in f:
            line=line.strip('\n')
            t+=line
    return t


def load_files(path):
    files_list=[]
    for r, d, files in os.walk(path):
        for file in files:
            if file.endswith('.php'):
                file_path=path+file
                #print("Load %s" % file_path)
                t=load_file(file_path)
                files_list.append(t)
    return  files_list



if __name__ == '__main__':

    webshell_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",token_pattern = r'\b\w+\b',min_df=1)
    webshell_files_list=load_files("../data/PHP-WEBSHELL/xiaoma/")
    x1=webshell_bigram_vectorizer.fit_transform(webshell_files_list).toarray()
    print(len(x1), x1[0])
    y1=[1]*len(x1)

    vocabulary=webshell_bigram_vectorizer.vocabulary_
    wp_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), 
decode_error="ignore", token_pattern = r'\b\w+\b',min_df=1,vocabulary=vocabulary)
    wp_files_list=load_files("../data/wordpress/")
    x2=wp_bigram_vectorizer.fit_transform(wp_files_list).toarray()
    print(len(x2), x2[0])
    y2=[0]*len(x2)
    x=np.concatenate((x1,x2))
    y=np.concatenate((y1, y2))

    clf = GaussianNB()
    # 使用三折交叉验证
    scores = model_selection.cross_val_score(clf, x, y, n_jobs=1, cv=3)
    print(scores)
    print(scores.mean())

6.运行结果（3折交叉验证）

[0.71153846 0.88235294 0.74509804]
0.7796631473102061

7.10折交叉验证结果

代码如下

    # 使用三折交叉验证
    scores = model_selection.cross_val_score(clf, x, y, n_jobs=1, cv=10)
    print(scores)
    print(scores.mean())

运行结果如下

[0.75       0.4375     0.625      0.6875     0.73333333 0.66666667
 0.73333333 0.53333333 0.46666667 0.53333333]
0.6166666666666666

标签：WebShell,Web,bigram,file,7.3,webshell,path,files,vectorizer
来源： https://blog.csdn.net/mooyuan/article/details/122756613