KNN算法中,所选择的邻居都是已经正确分类的对象。该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。 如下图所示,要决定绿色圆属于哪个类,是红色三角形还是蓝色四方形?如果K=3,由于红色三角形所占比例为2/3,绿色圆将被赋予红色三角形那个类,如果K=5,由于蓝色四方形比例为3/5,因此绿色圆被赋予蓝色四方形类。
KNN算法思想:在训练集中数据和标签已知的情况下,输入测试数据,将测试数据的特征与训练集中对应的特征进行相互比较,找到训练集中与之最为相似的前K个数据,则该测试数据对应的类别就是K个数据中出现次数最多的那个分类,其算法的描述为:
1)计算测试数据与各个训练数据之间的距离;
2)按照距离的递增关系进行排序;
3)选取距离最小的K个点;
4)确定前K个点所在类别的出现频率;
5)返回前K个点中出现频率最高的类别作为测试数据的预测分类。
1)算距离:给定测试对象,计算它与训练集中的每个对象的距离
2)找邻居:圈定距离最近的k个训练对象,作为测试对象的近邻
3)做分类:统计前K个点中每个类别的样本出现的频率;返回前K个点出现频率最高的类别作为当前点的预测分类。
KNN算法优缺点:
1、优点
1.1简单,易于理解,易于实现,无需估计参数,无需训练;
1.2适合对稀有事件进行分类(例如当流失率很低时,比如低于0.5%,构造流失预测模型)
特别适合于多分类问题(multi-modal,对象具有多个类别标签),例如根据基因特征来判断其功能分类,kNN比SVM的表现要好
2、缺点
2.1懒惰算法,对测试样本分类时的计算量大,内存开销大,评分慢;
2.2可解释性较差,无法给出决策树那样的规则。
示例1:
有以下先验数据,使用KNN算法对未知类别数据分类
属性1 |
属性2 |
类别 |
1.0 |
0.9 |
A |
1.0 |
1.0 |
A |
0.1 |
0.2 |
B |
0.0 |
0.1 |
B |
未知类别数据
属性1 |
属性2 |
类别 |
1.2 |
1.0 |
? |
0.1 |
0.3 |
? |
新建knn.py文件
#coding:utf-8
import numpy as np
#创建一个数据集,包含2个类别共4个样本
def createDataSet():
# 生成一个矩阵,每行表示一个样本
group = np.array([[1.0,0.9],[1.0,1.0],[0.1,0.2],[0.0,0.1]])
# 4个样本分别所属的类别
labels = ['A', 'A', 'B', 'B']
return group, labels
# KNN分类算法函数定义
def KNNClassify(newInput, dataSet, labels, k):
numSamples = dataSet.shape[0] #shape[0]表示行数
## step1:计算距离
# tile(A, reps):构造一个矩阵,通过A重复reps次得到
# the following copy numSamples rows for dataSet
diff = np.tile(newInput, (numSamples, 1)) - dataSet #按元素求差值
squareDiff = diff ** 2 #将差值平方
squareDist = np.sum(squareDiff, axis = 1) # 按行累加
distance = squareDist ** 0.5
##step2:对距离排序
# argsort() 返回排序后的索引值
sortedDistIndices = np.argsort(distance) ##根据元素的值从大到小对元素进行排序,返回下标
classCount = {} # define a dictionary (can be append element)
for i in range(k):
##step 3: 选择k个最近邻
voteLabel = labels[sortedDistIndices[i]]
## step 4:计算k个最近邻中各类别出现的次数
# when the key voteLabel is not in dictionary classCount,get()
# will return 0
classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
#print(classCount)
##step 5:返回出现次数最多的类别标签
maxCount = 0
for key, value in classCount.items():
if value > maxCount:
maxCount = value
maxIndex = key
return maxIndex
#生成数据集和类别标签
dataSet,labels = createDataSet()
#定义一个未知类别的数据
testX = np.array([1.2, 1.0])
k = 3
#调用分类函数对未知数据分类
outputLabel = KNNClassify(testX, dataSet, labels, k)
print "Your input is:", testX, " and classified to class:", outputLabel
testX = np.array([0.1, 0.3])
outputLabel = KNNClassify(testX,dataSet, labels, 3)
print "Your input is:", testX, "and classified to class:", outputLabel
示例2:
用kNN来分类一个手写数字的数据,这个数据包括数字0-9的手写体。每个数字大约有200个样本。手写体图像本身的大小是32x32的二值图,转换到txt文件保存后,内容也是32x32个数字。数据样本有两个目录:目录trainingDigits存放训练数据,testDigits存放测试数据。
数据库链接地址:http://download.csdn.net/detail/piaoxuezhong/9745648
新建一个knn.py脚本文件,文件里面包含四个函数:一个实现knn分类算法,一个用来生成将每个样本的txt文件转换为对应的一个向量,一个用来加载整个数据样本,一个加载测试的函数。
import numpy as np
import os
# classify by knn
def kNNClassify(newInput, dataSet, labels, k):
numSamples = dataSet.shape[0] # shape[0] give the num of row
#step 1: calculate Euclidean distance
diff = np.tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise
squaredDiff = diff ** 2 # squared for the subtract
squaredDist = np.sum(squaredDiff, axis = 1) # sum is performed by row
distance = squaredDist ** 0.5
#step 2: sort the distance
sortedDistIndices = np.argsort(distance)
classCount = {} # define a dictionary
for i in range(k):
#step 3: choose the min k distance
voteLabel = labels[sortedDistIndices[i]]
#step 4: count the times labels occur
# when the key voteLabel is not in dictionary classCount, get()will return 0
classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
#step 5: the max voted class will return
maxCount = 0
for key, value in classCount.items():
if value > maxCount:
maxCount = value
maxIndex = key
return maxIndex
# convert image to vector
def img2vector(filename):
rows = 32
cols = 32
imgVector = np.zeros((1, rows * cols))
fileIn = open(filename)
for row in range(rows):
lineStr = fileIn.readline()
for col in range(cols):
imgVector[0, row * 32 + col] = int(lineStr[col])
return imgVector
# load dataSet
def loadDataSet():
#step 1: Getting training set
print "Getting training set..."
dataSetDir = /data/temp/digits/
trainingFileList = os.listdir(dataSetDir + trainingDigits)
numSamples = len(trainingFileList)
train_x = np.zeros((numSamples, 1024))
train_y = []
for i in range(numSamples):
filename = trainingFileList[i]
# get train_x
train_x[i, :] = img2vector(dataSetDir + trainingDigits/%s % filename)
# get label from file name such as "1_18.txt"
label = int(filename.split(_)[0]) # return 1
train_y.append(label)
#step 2: Getting testing set
print "Getting testing set..."
testingFileList = os.listdir(dataSetDir + testDigits)
numSamples = len(testingFileList)
test_x = np.zeros((numSamples, 1024))
test_y = []
for i in range(numSamples):
filename = testingFileList[i]
# get train_x
test_x[i, :] = img2vector(dataSetDir + testDigits/%s % filename)
# get label from file name such as "1_18.txt"
label = int(filename.split(_)[0]) # return 1
test_y.append(label)
return train_x, train_y, test_x, test_y
# test hand writing class
def testHandWritingClass():
#step 1: load data
print "step 1: load data..."
train_x, train_y, test_x, test_y = loadDataSet()
#step 2: training...
print "step 2: training..."
pass
#step 3: testing
print "step 3: testing..."
numTestSamples = test_x.shape[0]
matchCount = 0
for i in range(numTestSamples):
predict = kNNClassify(test_x[i], train_x, train_y, 3)
if predict == test_y[i]:
matchCount += 1
accuracy = float(matchCount) / numTestSamples
#step 4: show the result
print "step 4: show the result..."
print The classify accuracy is: %.2f%% % (accuracy * 100)
另外,新建一个knn_test.py文件,用来测试实现knn算法:
import knn
knn.testHandWritingClass()
F5运行结果如下:
>>>
step 1: load data...
Getting training set...
Getting testing set...
step 2: training...
step 3: testing...
step 4: show the result...
The classify accuracy is: 98.84%