building machine learning system in python(2)

Building more complex classifiers

之前我们使用的是一个非常简单的模型,一个在单独维度上的简单的threshold,我们可以看到很多其它的模型。

那么什么组成一个classification model呢?我们可以把它分为以下三个部分:

  • the structure of the model:在之前的例子中,我们使用的是,单独特征的一个threshold
  • the search procedure:在之前的例子中,我们用了每种特征和threshold的组合
  • The loss function:使用loss function,我们要决定那种可能错误更少一点,我们可以用training error,通常我们希望loss function能够实现最小化

我们可以用这些来得到不同的结果,比如说,我们可以尝试得到一个使得training error最小的threshold,但我们只对每个特征检验三个值:mean value, mean + one standard deviation, 和 mena - one standard deviation。这样比较靠谱因为如果要test所有的值的话就太耗时了(假设我们有很多很多个data)

或者说,我们也可以用一个不同的loss function。有可能出现一种类型误差比另外一种类型的误差是更加耗时的。False negatives和False positives是不一样的。False negatives(当测试的结果显示是negative的时候,但这是错误的,比如说以为没有这类疾病,但实际上是有的,也就是漏报率),False positives(当测试结果显示Positive的时候,但这是错误的,比如以为有这类疾病,但实际上没有,也就是误报率)。

You want to minimize the false negatives as much as you can 因为这样更加危险。

A more complex dataset and a more complex classifier

接下来看点稍微复杂一点的dataset,以此作为分类算法的介绍。

Seeds Dataset

接下来看一个有关农业的数据,它仍旧很小,但是已经大到难以让我们像对待Iris Flower那样简单的plot这些数据了。这是一个用来衡量小麦种子的数据集。一共有七个特征:

  • Area(A)
  • Perimeter (P)
  • Compactness (C = 4piA/P^2)
  • Length of kernel
  • Width of kernel
  • Asymmetry coefficient
  • Length of kernel groove

一共有三种小麦种类:Canadian, Koma, and Rosa 像之前一样我们希望能用它们的形态为他们分类

Features and Feature engineering

有趣的是,在上面的特征中,其实Compactness并不是新的特征,而是从Area和Perimeter得到的。通常这样得到新的组合而成的特征是很有用处的,这被称作为feature engineering,它可能没有算法那么亮眼,但是它可能对最终的表现更加重要,因为选好一个feature可以比 一个好的算法但却用一个不好的feature表现的更好。

Compactness 也称作roundness圆度,通常用来衡量形状。比如如果有两个谷粒A和B,A比B大一倍,但如果A与B的形状是一样的,那它们的Compactness也还是一样的。一个谷粒越圆Compactness就越接近1,越是不圆就越接近0。

这样的特征就可以轻松衡量圆度,而不受到大小的影响。

另外也要注意,遇到较大的project的时候,要学会Feature selection,面对过多的特征,选择真正重要的那些特征。

Nearest neighbor classification

在这个dataset之下,如果还用之前的简单的threshold,那我们得不到很好的结果,我们现在使用nearest neighbor这个classifier

如果我们考虑到每个data都是一个具有N个维度的点的话,我们就可以计算每个点之间的距离。

def distance(p0, p1):
    return np.sum((p0 - p1)**2)

import numpy as np
p0 = np.array([1,2])
p1 = np.array([0,0])

def nn_classify(training_set, training_labels, new_example):
    dists = np.array([distance(t, new_example) for t in training_set])
    nearest = dists.argmin()
    return training_labels[nearest]
def load_dataset(dataset_name):
    '''
    data,labels = load_dataset(dataset_name)

    Load a given dataset

    Returns
    -------
    data : numpy ndarray
    labels : list of str
    '''
    data = []
    labels = []
    with open('{0}.tsv'.format(dataset_name)) as ifile:
        for line in ifile:
            tokens = line.strip().split('\t')
            data.append([float(tk) for tk in tokens[:-1]])
            labels.append(tokens[-1])
    data = np.array(data)
    labels = np.array(labels)
    return data, labels

def fit_model(k, features, labels):
    '''Learn a k-nn model'''
    # There is no model in k-nn, just a copy of the inputs
    return k, features.copy(), labels.copy()


def plurality(xs):
    '''Find the most common element in a collection'''
    from collections import defaultdict
    counts = defaultdict(int)
    for x in xs:
        counts[x] += 1
    maxv = max(counts.values())
    for k in counts.keys():
        if counts[k] == maxv:
            return k
# 这里的K 就是选出离点最近的k个点
def predict(features, model):
    """Apply k-nn model"""
    # according to testing features to predict
    # model中包含的是training features and training labels
    k, train_feats, labels = model
    results = []
    for f in features: #这里的features是testing features
        label_dist = []
        # compute all distance
        for t, ell in zip(train_feats, labels):
            # 用向量范数来替代距离
            # 使用范数可以测量两个函数、向量或矩阵之间的距离。
            # 这里的f和t都包含七个数,代表七个feature
            label_dist.append((np.linalg.norm(f - t), ell))
        label_dist.sort(key=lambda d_ell:d_ell[0])
        #这里找到最近的K个点
        label_dist = label_dist[:k]
        #在这K个点之中,出现最多次的label被加入到results之中
        results.append(plurality([ell for _, ell in label_dist]))
    return np.array(results)
        
def accuracy(features, labels, model):
    #这里放入的是 testing features 和 testing labels
    preds = predict(features, model)
    return np.mean(preds == labels)

以上描述了knn这个模型,下面是Implementation

features, labels = load_dataset('seeds')
def cross_validate(features, labels):
    error = 0.0
    for fold in range(10):
        training = np.ones(len(features), bool)
        training[fold::10] = 0 #使得fold位置和fold + 10, fold+20...位置变为0
        testing = ~training
        model = fit_model(1, features[training], labels[training])
        test_error = accuracy(features[testing], labels[testing], model)
        error += test_error
    
    return error / 10.0

error = cross_validate(features, labels)
error

0.89523809523809528

我们之前的方法是把所有七个特征都代入进一个向量,通过范数来衡量距离。但是我们可以normalize所有的特征到一个scale上,比如说使用normalize to z-score的办法。Z-score value是以标准差的单位来看,离均值有多远。

z_score_features = features.copy()
z_score_features -= features.mean(0)
z_score_features /= z_score_features.std(0)
z_score_error = cross_validate(z_score_features, labels)
z_score_error

0.94285714285714284

normalize的过程: \(f' = \frac{f - \mu}{\sigma}\)

在normalize to z_score之前,

在normalize to z_score之后,

我们也可以用sklearn来实现

from sklearn.cross_validation import KFold
from sklearn.neighbors import KNeighborsClassifier
features, labels = load_dataset('seeds')
classifier = KNeighborsClassifier(n_neighbors=1)
kf = KFold(len(features), n_folds=5, shuffle = True)
means = []
for training ,testing in kf:
    # we fit a model for this fold, then apply it to the testing data with 'predict'
    classifier.fit(features[training], labels[training])
    prediction = classifier.predict(features[testing])
    
    # np.mean on an array of booleans return fraction of correct decisions of this fold
    curmean = np.mean(prediction == labels[testing])
    means.append(curmean)
# 估计这里答案会变化的原因是 cut the fold是随机选第一个位置的,而不是固定从第一个[0::10]
res = np.mean(means)
print '{:.1%}'.format(res)

91.9%

使用leave-one-out

n = len(features)
correct = 0.0
for ei in range(n):
    training = np.ones(n, bool)
    training[ei] = 0
    testing = ~training
    classifier.fit(features[training], labels[training])
    pred = classifier.predict(features[testing])
    correct += (pred == labels[testing])
    
correct = correct / n
print '{:.1%}'.format(correct[0])

90.5%

sklearn下的normalization

# z-score 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
classifier = KNeighborsClassifier(n_neighbors=1)
classifier = Pipeline([('norm', StandardScaler()), ('knn', classifier)])
# The Pipeline constructor takes a list of pairs (str,clf). Each pair corresponds to a step in
# the pipeline: the first element is a string naming the step, while the second element is the
# object that performs the transformation.

可以看出来Pipeline对classifer进行了改变,classifier会对数据先进行normalize

After normalization, every feature is in the same units (technically, every feature is now dimensionless; it has no units) and we can more confidently mix dimensions.

kf = KFold(len(features), n_folds=5, shuffle = True)
means = []
for training ,testing in kf:
    # we fit a model for this fold, then apply it to the testing data with 'predict'
    classifier.fit(features[training], labels[training])
    prediction = classifier.predict(features[testing])
    
    # np.mean on an array of booleans return fraction of correct decisions of this fold
    curmean = np.mean(prediction == labels[testing])
    means.append(curmean)
res = np.mean(means)
print '{:.1%}'.format(res)

93.8%

while a few dimensions are dominant in the original data, after normalization, they are all given the same importance.

在normalization之后,数据包含的七个特征的权重都变得一样了。

comments powered by Disqus