登录 |  注册 |  繁體中文


机器学习---KNN算法

分类: 人工智能&大数据 颜色:橙色 默认  字号: 阅读(1089) | 评论(0)
K最近邻(k-Nearest Neighbor,KNN)分类算法,通过测量不同特征值之间的距离进行分类,其指导思想是“近朱者赤,近墨者黑”,如果一个样本在特征空间中的k个最相似的样本中的大多数属于某一个类别,则该样本也属于这个类别。K通常为不大于20的整数。
KNN算法中,所选择的邻居都是已经正确分类的对象。该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。 如下图所示,要决定绿色圆属于哪个类,是红色三角形还是蓝色四方形?如果K=3,由于红色三角形所占比例为2/3,绿色圆将被赋予红色三角形那个类,如果K=5,由于蓝色四方形比例为3/5,因此绿色圆被赋予蓝色四方形类。

 

KNN算法思想:在训练集中数据和标签已知的情况下,输入测试数据,将测试数据的特征与训练集中对应的特征进行相互比较,找到训练集中与之最为相似的前K个数据,则该测试数据对应的类别就是K个数据中出现次数最多的那个分类,其算法的描述为:
1)计算测试数据与各个训练数据之间的距离;
2)按照距离的递增关系进行排序;
3)选取距离最小的K个点;
4)确定前K个点所在类别的出现频率;
5)返回前K个点中出现频率最高的类别作为测试数据的预测分类。

KNN计算步骤如下:
1)算距离:给定测试对象,计算它与训练集中的每个对象的距离
2)找邻居:圈定距离最近的k个训练对象,作为测试对象的近邻
3)做分类:统计前K个点中每个类别的样本出现的频率;返回前K个点出现频率最高的类别作为当前点的预测分类。

KNN算法优缺点:
1、优点
1.1简单,易于理解,易于实现,无需估计参数,无需训练;
1.2适合对稀有事件进行分类(例如当流失率很低时,比如低于0.5%,构造流失预测模型)
特别适合于多分类问题(multi-modal,对象具有多个类别标签),例如根据基因特征来判断其功能分类,kNN比SVM的表现要好
2、缺点
2.1懒惰算法,对测试样本分类时的计算量大,内存开销大,评分慢;
2.2可解释性较差,无法给出决策树那样的规则。

示例1:

有以下先验数据,使用KNN算法对未知类别数据分类

属性1

属性2

类别

1.0

0.9

A

1.0

1.0

A

0.1

0.2

B

0.0

0.1

B

未知类别数据

属性1

属性2

类别

1.2

1.0

?

0.1

0.3

?

新建knn.py文件


#coding:utf-8

import numpy as np


#创建一个数据集,包含2个类别共4个样本
def createDataSet():
        # 生成一个矩阵,每行表示一个样本
        group = np.array([[1.0,0.9],[1.0,1.0],[0.1,0.2],[0.0,0.1]])
        # 4个样本分别所属的类别
        labels = ['A', 'A', 'B', 'B']
        return group, labels

# KNN分类算法函数定义
def KNNClassify(newInput, dataSet, labels, k):
        numSamples = dataSet.shape[0]  #shape[0]表示行数
      
        ## step1:计算距离
        # tile(A, reps):构造一个矩阵,通过A重复reps次得到
        # the following copy numSamples rows for dataSet
        diff = np.tile(newInput, (numSamples, 1)) - dataSet #按元素求差值
        squareDiff = diff ** 2 #将差值平方
        squareDist = np.sum(squareDiff, axis = 1) # 按行累加

        distance = squareDist ** 0.5

        ##step2:对距离排序
        # argsort() 返回排序后的索引值
        sortedDistIndices = np.argsort(distance) ##根据元素的值从大到小对元素进行排序,返回下标
        classCount = {} # define a dictionary (can be append element)
        for i in range(k):
             ##step 3: 选择k个最近邻
             voteLabel = labels[sortedDistIndices[i]]

             ## step 4:计算k个最近邻中各类别出现的次数
             # when the key voteLabel is not in dictionary classCount,get()
             # will return 0
             classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
        #print(classCount)
        ##step 5:返回出现次数最多的类别标签
        maxCount = 0
        for key, value in classCount.items():
            if value > maxCount:
                maxCount = value
                maxIndex = key

        return maxIndex




#生成数据集和类别标签
dataSet,labels =  createDataSet()
#定义一个未知类别的数据
testX = np.array([1.2, 1.0])
k = 3
#调用分类函数对未知数据分类
outputLabel = KNNClassify(testX, dataSet, labels, k)
print "Your input is:", testX, " and classified to class:", outputLabel

testX = np.array([0.1, 0.3])
outputLabel = KNNClassify(testX,dataSet, labels, 3)
print "Your input is:", testX, "and classified to class:", outputLabel

示例2:

用kNN来分类一个手写数字的数据,这个数据包括数字0-9的手写体。每个数字大约有200个样本。手写体图像本身的大小是32x32的二值图,转换到txt文件保存后,内容也是32x32个数字。数据样本有两个目录:目录trainingDigits存放训练数据,testDigits存放测试数据。

数据库链接地址:http://download.csdn.net/detail/piaoxuezhong/9745648
新建一个knn.py脚本文件,文件里面包含四个函数:一个实现knn分类算法,一个用来生成将每个样本的txt文件转换为对应的一个向量,一个用来加载整个数据样本,一个加载测试的函数。

import numpy as np
import os  

  
# classify by knn 
def kNNClassify(newInput, dataSet, labels, k):
    numSamples = dataSet.shape[0] # shape[0] give the num of row  
  
    #step 1: calculate Euclidean distance   
    diff = np.tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise  
    squaredDiff = diff ** 2 # squared for the subtract  
    squaredDist = np.sum(squaredDiff, axis = 1) # sum is performed by row  
    distance = squaredDist ** 0.5  
  
    #step 2: sort the distance  
    sortedDistIndices = np.argsort(distance)  
  
    classCount = {} # define a dictionary 
    for i in range(k):  
        #step 3: choose the min k distance  
        voteLabel = labels[sortedDistIndices[i]]  
  
        #step 4: count the times labels occur  
        # when the key voteLabel is not in dictionary classCount, get()will return 0  
        classCount[voteLabel] = classCount.get(voteLabel, 0) + 1  
  
    #step 5: the max voted class will return  
    maxCount = 0  
    for key, value in classCount.items():  
        if value > maxCount:  
            maxCount = value  
            maxIndex = key  
  
    return maxIndex   
  
# convert image to vector  
def  img2vector(filename):  
    rows = 32  
    cols = 32  
    imgVector = np.zeros((1, rows * cols))   
    fileIn = open(filename)  
    for row in range(rows):  
        lineStr = fileIn.readline()  
        for col in range(cols):  
            imgVector[0, row * 32 + col] = int(lineStr[col])  
  
    return imgVector  
  
# load dataSet  
def loadDataSet():  
    #step 1: Getting training set  
    print "Getting training set..."  
    dataSetDir = /data/temp/digits/  
    trainingFileList = os.listdir(dataSetDir + trainingDigits) 
    numSamples = len(trainingFileList)  
  
    train_x = np.zeros((numSamples, 1024))  
    train_y = []  
    for i in range(numSamples):  
        filename = trainingFileList[i]  
  
        # get train_x  
        train_x[i, :] = img2vector(dataSetDir + trainingDigits/%s % filename)   
  
        # get label from file name such as "1_18.txt"  
        label = int(filename.split(_)[0]) # return 1  
        train_y.append(label)  
  
    #step 2: Getting testing set  
    print "Getting testing set..."  
    testingFileList = os.listdir(dataSetDir + testDigits)
    numSamples = len(testingFileList)  
    test_x = np.zeros((numSamples, 1024))  
    test_y = []  
    for i in range(numSamples):  
        filename = testingFileList[i]  
  
        # get train_x  
        test_x[i, :] = img2vector(dataSetDir + testDigits/%s % filename)   
  
        # get label from file name such as "1_18.txt"  
        label = int(filename.split(_)[0]) # return 1  
        test_y.append(label)  
  
    return train_x, train_y, test_x, test_y  
  
# test hand writing class  
def testHandWritingClass():
    #step 1: load data  
    print "step 1: load data..."
    train_x, train_y, test_x, test_y = loadDataSet()  
  
    #step 2: training...  
    print "step 2: training..."  
    pass  
  
    #step 3: testing  
    print "step 3: testing..."  
    numTestSamples = test_x.shape[0]
    matchCount = 0  
    for i in range(numTestSamples):  
        predict = kNNClassify(test_x[i], train_x, train_y, 3)
        if predict == test_y[i]:  
            matchCount += 1  
    accuracy = float(matchCount) / numTestSamples  
  
    #step 4: show the result  
    print "step 4: show the result..."  
    print The classify accuracy is: %.2f%% % (accuracy * 100)


  
另外,新建一个knn_test.py文件,用来测试实现knn算法:
import knn
knn.testHandWritingClass()  
F5运行结果如下:
>>> 
step 1: load data...
Getting training set...
Getting testing set...
step 2: training...
step 3: testing...
step 4: show the result...
The classify accuracy is: 98.84%


上一篇:numpy 切片   下一篇:机器学习分类算法[转]

姓 名: *
邮 箱:
内 容: *
验证码: 点击刷新 *   

回到顶部