在线看毛片网站电影-亚洲国产欧美日韩精品一区二区三区,国产欧美乱夫不卡无乱码,国产精品欧美久久久天天影视,精品一区二区三区视频在线观看,亚洲国产精品人成乱码天天看,日韩久久久一区,91精品国产91免费

<menu id="6qfwx"><li id="6qfwx"></li></menu>
    1. <menu id="6qfwx"><dl id="6qfwx"></dl></menu>

      <label id="6qfwx"><ol id="6qfwx"></ol></label><menu id="6qfwx"></menu><object id="6qfwx"><strike id="6qfwx"><noscript id="6qfwx"></noscript></strike></object>
        1. <center id="6qfwx"><dl id="6qfwx"></dl></center>

            新聞中心

            EEPW首頁(yè) > 消費(fèi)電子 > 設(shè)計(jì)應(yīng)用 > 標(biāo)簽傳播算法(Label Propagation)及Python實(shí)現(xiàn)

            標(biāo)簽傳播算法(Label Propagation)及Python實(shí)現(xiàn)

            作者: 時(shí)間:2018-07-24 來(lái)源:網(wǎng)絡(luò) 收藏

            眾所周知,機(jī)器學(xué)習(xí)可以大體分為三大類:監(jiān)督學(xué)習(xí)、非監(jiān)督學(xué)習(xí)和半監(jiān)督學(xué)習(xí)。監(jiān)督學(xué)習(xí)可以認(rèn)為是我們有非常多的labeled標(biāo)注數(shù)據(jù)來(lái)train一個(gè)模型,期待這個(gè)模型能學(xué)習(xí)到數(shù)據(jù)的分布,以期對(duì)未來(lái)沒(méi)有見(jiàn)到的樣本做預(yù)測(cè)。那這個(gè)性能的源頭--訓(xùn)練數(shù)據(jù),就顯得非常感覺(jué)。你必須有足夠的訓(xùn)練數(shù)據(jù),以覆蓋真正現(xiàn)實(shí)數(shù)據(jù)中的樣本分布才可以,這樣學(xué)習(xí)到的模型才有意義。那非監(jiān)督學(xué)習(xí)就是沒(méi)有任何的labeled數(shù)據(jù),就是平時(shí)所說(shuō)的聚類了,利用他們本身的數(shù)據(jù)分布,給他們劃分類別。而半監(jiān)督學(xué)習(xí),顧名思義就是處于兩者之間的,只有少量的labeled數(shù)據(jù),我們?cè)噲D從這少量的labeled數(shù)據(jù)和大量的unlabeled數(shù)據(jù)中學(xué)習(xí)到有用的信息。

            本文引用地址:http://www.biyoush.com/article/201807/383624.htm

            一、半監(jiān)督學(xué)習(xí)

            半監(jiān)督學(xué)習(xí)(Semi-supervised learning)發(fā)揮作用的場(chǎng)合是:你的數(shù)據(jù)有一些有l(wèi)abel,一些沒(méi)有。而且一般是絕大部分都沒(méi)有,只有少許幾個(gè)有l(wèi)abel。半監(jiān)督學(xué)習(xí)算法會(huì)充分的利用unlabeled數(shù)據(jù)來(lái)捕捉我們整個(gè)數(shù)據(jù)的潛在分布。它基于三大假設(shè):

            1)Smoothness平滑假設(shè):相似的數(shù)據(jù)具有相同的label。

            2)Cluster聚類假設(shè):處于同一個(gè)聚類下的數(shù)據(jù)具有相同label。

            3)Manifold流形假設(shè):處于同一流形結(jié)構(gòu)下的數(shù)據(jù)具有相同label。

            例如下圖,只有兩個(gè)labeled數(shù)據(jù),如果直接用他們來(lái)訓(xùn)練一個(gè)分類器,例如LR或者SVM,那么學(xué)出來(lái)的分類面就是左圖那樣的。如果現(xiàn)實(shí)中,這個(gè)數(shù)據(jù)是右圖那邊分布的話,豬都看得出來(lái),左圖訓(xùn)練的這個(gè)分類器爛的一塌糊涂、慘不忍睹。因?yàn)槲覀兊膌abeled訓(xùn)練數(shù)據(jù)太少了,都沒(méi)辦法覆蓋我們未來(lái)可能遇到的情況。但是,如果右圖那樣,把大量的unlabeled數(shù)據(jù)(黑色的)都考慮進(jìn)來(lái),有個(gè)全局觀念,牛逼的算法會(huì)發(fā)現(xiàn),哎喲,原來(lái)是兩個(gè)圈圈(分別處于兩個(gè)圓形的流形之上)!那算法就很聰明,把大圈的數(shù)據(jù)都?xì)w類為紅色類別,把內(nèi)圈的數(shù)據(jù)都?xì)w類為藍(lán)色類別。因?yàn)?,?shí)踐中,labeled數(shù)據(jù)是昂貴,很難獲得的,但unlabeled數(shù)據(jù)就不是了,寫個(gè)腳本在網(wǎng)上爬就可以了,因此如果能充分利用大量的unlabeled數(shù)據(jù)來(lái)輔助提升我們的模型學(xué)習(xí),這個(gè)價(jià)值就非常大。

            半監(jiān)督學(xué)習(xí)算法有很多,下面我們介紹最簡(jiǎn)單的標(biāo)簽傳播算法(label propagation),最喜歡簡(jiǎn)單了,哈哈。

            二、標(biāo)簽傳播算法

            標(biāo)簽傳播算法(label propagation)的核心思想非常簡(jiǎn)單:相似的數(shù)據(jù)應(yīng)該具有相同的label。LP算法包括兩大步驟:1)構(gòu)造相似矩陣;2)勇敢的傳播吧。

            2.1、相似矩陣構(gòu)建

            LP算法是基于Graph的,因此我們需要先構(gòu)建一個(gè)圖。我們?yōu)樗械臄?shù)據(jù)構(gòu)建一個(gè)圖,圖的節(jié)點(diǎn)就是一個(gè)數(shù)據(jù)點(diǎn),包含labeled和unlabeled的數(shù)據(jù)。節(jié)點(diǎn)i和節(jié)點(diǎn)j的邊表示他們的相似度。這個(gè)圖的構(gòu)建方法有很多,這里我們假設(shè)這個(gè)圖是全連接的,節(jié)點(diǎn)i和節(jié)點(diǎn)j的邊權(quán)重為:

            這里,α是超參。

            還有個(gè)非常常用的圖構(gòu)建方法是knn圖,也就是只保留每個(gè)節(jié)點(diǎn)的k近鄰權(quán)重,其他的為0,也就是不存在邊,因此是稀疏的相似矩陣。

            2.2、LP算法

            標(biāo)簽傳播算法非常簡(jiǎn)單:通過(guò)節(jié)點(diǎn)之間的邊傳播label。邊的權(quán)重越大,表示兩個(gè)節(jié)點(diǎn)越相似,那么label越容易傳播過(guò)去。我們定義一個(gè)NxN的概率轉(zhuǎn)移矩陣P:

            Pij表示從節(jié)點(diǎn)i轉(zhuǎn)移到節(jié)點(diǎn)j的概率。假設(shè)有C個(gè)類和L個(gè)labeled樣本,我們定義一個(gè)LxC的label矩陣YL,第i行表示第i個(gè)樣本的標(biāo)簽指示向量,即如果第i個(gè)樣本的類別是j,那么該行的第j個(gè)元素為1,其他為0。同樣,我們也給U個(gè)unlabeled樣本一個(gè)UxC的label矩陣YU。把他們合并,我們得到一個(gè)NxC的soft label矩陣F=[YL;YU]。soft label的意思是,我們保留樣本i屬于每個(gè)類別的概率,而不是互斥性的,這個(gè)樣本以概率1只屬于一個(gè)類。當(dāng)然了,最后確定這個(gè)樣本i的類別的時(shí)候,是取max也就是概率最大的那個(gè)類作為它的類別的。那F里面有個(gè)YU,它一開(kāi)始是不知道的,那最開(kāi)始的值是多少?無(wú)所謂,隨便設(shè)置一個(gè)值就可以了。

            千呼萬(wàn)喚始出來(lái),簡(jiǎn)單的LP算法如下:

            1)執(zhí)行傳播:F=PF

            2)重置F中l(wèi)abeled樣本的標(biāo)簽:FL=YL

            3)重復(fù)步驟1)和2)直到F收斂。

            步驟1)就是將矩陣P和矩陣F相乘,這一步,每個(gè)節(jié)點(diǎn)都將自己的label以P確定的概率傳播給其他節(jié)點(diǎn)。如果兩個(gè)節(jié)點(diǎn)越相似(在歐式空間中距離越近),那么對(duì)方的label就越容易被自己的label賦予,就是更容易拉幫結(jié)派。步驟2)非常關(guān)鍵,因?yàn)閘abeled數(shù)據(jù)的label是事先確定的,它不能被帶跑,所以每次傳播完,它都得回歸它本來(lái)的label。隨著labeled數(shù)據(jù)不斷的將自己的label傳播出去,最后的類邊界會(huì)穿越高密度區(qū)域,而停留在低密度的間隔中。相當(dāng)于每個(gè)不同類別的labeled樣本劃分了勢(shì)力范圍。

            2.3、變身的LP算法

            我們知道,我們每次迭代都是計(jì)算一個(gè)soft label矩陣F=[YL;YU],但是YL是已知的,計(jì)算它沒(méi)有什么用,在步驟2)的時(shí)候,還得把它弄回來(lái)。我們關(guān)心的只是YU,那我們能不能只計(jì)算YU呢?Yes。我們將矩陣P做以下劃分:

            這時(shí)候,我們的算法就一個(gè)運(yùn)算:

            迭代上面這個(gè)步驟直到收斂就ok了,是不是很cool??梢钥吹紽U不但取決于labeled數(shù)據(jù)的標(biāo)簽及其轉(zhuǎn)移概率,還取決了unlabeled數(shù)據(jù)的當(dāng)前l(fā)abel和轉(zhuǎn)移概率。因此LP算法能額外運(yùn)用unlabeled數(shù)據(jù)的分布特點(diǎn)。

            這個(gè)算法的收斂性也非常容易證明,具體見(jiàn)參考文獻(xiàn)[1]。實(shí)際上,它是可以收斂到一個(gè)凸解的:

            所以我們也可以直接這樣求解,以獲得最終的YU。但是在實(shí)際的應(yīng)用過(guò)程中,由于矩陣求逆需要O(n3)的復(fù)雜度,所以如果unlabeled數(shù)據(jù)非常多,那么I – PUU矩陣的求逆將會(huì)非常耗時(shí),因此這時(shí)候一般選擇迭代算法來(lái)實(shí)現(xiàn)。

            三、LP算法的Python實(shí)現(xiàn)

            Python環(huán)境的搭建就不嗦了,可以參考前面的博客。需要額外依賴的庫(kù)是經(jīng)典的numpy和matplotlib。代碼中包含了兩種圖的構(gòu)建方法:RBF和KNN指定。同時(shí),自己生成了兩個(gè)toy數(shù)據(jù)庫(kù):兩條長(zhǎng)形形狀和兩個(gè)圈圈的數(shù)據(jù)。第四部分我們用大點(diǎn)的數(shù)據(jù)庫(kù)來(lái)做實(shí)驗(yàn),先簡(jiǎn)單的可視化驗(yàn)證代碼的正確性,再前線。

            算法代碼:

            #***************************************************************************

            #*

            #* Description: label propagation

            #* Author: Zou Xiaoyi ([email protected])

            #* Date: 2015-10-15

            #* HomePage: http://blog.csdn.net/zouxy09

            #*

            #**************************************************************************

            import time

            import numpy as np

            # return k neighbors index

            def navie_knn(dataSet, query, k):

            numSamples = dataSet.shape[0]

            ## step 1: calculate Euclidean distance

            diff = np.tile(query, (numSamples, 1)) - dataSet

            squaredDiff = diff ** 2

            squaredDist = np.sum(squaredDiff, axis = 1) # sum is performed by row

            ## step 2: sort the distance

            sortedDistIndices = np.argsort(squaredDist)

            if k > len(sortedDistIndices):

            k = len(sortedDistIndices)

            return sortedDistIndices[0:k]

            # build a big graph (normalized weight matrix)

            def buildGraph(MatX, kernel_type, rbf_sigma = None, knn_num_neighbors = None):

            num_samples = MatX.shape[0]

            affinity_matrix = np.zeros((num_samples, num_samples), np.float32)

            if kernel_type == 'rbf':

            if rbf_sigma == None:

            raise ValueError('You should input a sigma of rbf kernel!')

            for i in xrange(num_samples):

            row_sum = 0.0

            for j in xrange(num_samples):

            diff = MatX[i, :] - MatX[j, :]

            affinity_matrix[i][j] = np.exp(sum(diff**2) / (-2.0 * rbf_sigma**2))

            row_sum += affinity_matrix[i][j]

            affinity_matrix[i][:] /= row_sum

            elif kernel_type == 'knn':

            if knn_num_neighbors == None:

            raise ValueError('You should input a k of knn kernel!')

            for i in xrange(num_samples):

            k_neighbors = navie_knn(MatX, MatX[i, :], knn_num_neighbors)

            affinity_matrix[i][k_neighbors] = 1.0 / knn_num_neighbors

            else:

            raise NameError('Not support kernel type! You can use knn or rbf!')

            return affinity_matrix

            # label propagation

            def labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = 'rbf', rbf_sigma = 1.5,

            knn_num_neighbors = 10, max_iter = 500, tol = 1e-3):

            # initialize

            num_label_samples = Mat_Label.shape[0]

            num_unlabel_samples = Mat_Unlabel.shape[0]

            num_samples = num_label_samples + num_unlabel_samples

            labels_list = np.unique(labels)

            num_classes = len(labels_list)

            MatX = np.vstack((Mat_Label, Mat_Unlabel))

            clamp_data_label = np.zeros((num_label_samples, num_classes), np.float32)

            for i in xrange(num_label_samples):

            clamp_data_label[i][labels[i]] = 1.0

            label_function = np.zeros((num_samples, num_classes), np.float32)

            label_function[0 : num_label_samples] = clamp_data_label

            label_function[num_label_samples : num_samples] = -1

            # graph construction

            affinity_matrix = buildGraph(MatX, kernel_type, rbf_sigma, knn_num_neighbors)

            # start to propagation

            iter = 0; pre_label_function = np.zeros((num_samples, num_classes), np.float32)

            changed = np.abs(pre_label_function - label_function).sum()

            while iter max_iter and changed > tol:

            if iter % 1 == 0:

            print ---> Iteration %d/%d, changed: %f % (iter, max_iter, changed)

            pre_label_function = label_function

            iter += 1

            # propagation

            label_function = np.dot(affinity_matrix, label_function)

            # clamp

            label_function[0 : num_label_samples] = clamp_data_label

            # check converge

            changed = np.abs(pre_label_function - label_function).sum()

            # get terminate label of unlabeled data

            unlabel_data_labels = np.zeros(num_unlabel_samples)

            for i in xrange(num_unlabel_samples):

            unlabel_data_labels[i] = np.argmax(label_function[i+num_label_samples])

            return unlabel_data_labels

            測(cè)試代碼:

            #***************************************************************************

            #*

            #* Description: label propagation

            #* Author: Zou Xiaoyi ([email protected])

            #* Date: 2015-10-15

            #* HomePage: http://blog.csdn.net/zouxy09

            #*

            #**************************************************************************

            import time

            import math

            import numpy as np

            from label_propagation import labelPropagation

            # show

            def show(Mat_Label, labels, Mat_Unlabel, unlabel_data_labels):

            import matplotlib.pyplot as plt

            for i in range(Mat_Label.shape[0]):

            if int(labels[i]) == 0:

            plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], 'Dr')

            elif int(labels[i]) == 1:

            plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], 'Db')

            else:

            plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], 'Dy')

            for i in range(Mat_Unlabel.shape[0]):

            if int(unlabel_data_labels[i]) == 0:

            plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], 'or')

            elif int(unlabel_data_labels[i]) == 1:

            plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], 'ob')

            else:

            plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], 'oy')

            plt.xlabel('X1'); plt.ylabel('X2')

            plt.xlim(0.0, 12.)

            plt.ylim(0.0, 12.)

            plt.show()

            def loadCircleData(num_data):

            center = np.array([5.0, 5.0])

            radiu_inner = 2

            radiu_outer = 4

            num_inner = num_data / 3

            num_outer = num_data - num_inner

            data = []

            theta = 0.0

            for i in range(num_inner):

            pho = (theta % 360) * math.pi / 180

            tmp = np.zeros(2, np.float32)

            tmp[0] = radiu_inner * math.cos(pho) + np.random.rand(1) + center[0]

            tmp[1] = radiu_inner * math.sin(pho) + np.random.rand(1) + center[1]

            data.append(tmp)

            theta += 2

            theta = 0.0

            for i in range(num_outer):

            pho = (theta % 360) * math.pi / 180

            tmp = np.zeros(2, np.float32)

            tmp[0] = radiu_outer * math.cos(pho) + np.random.rand(1) + center[0]

            tmp[1] = radiu_outer * math.sin(pho) + np.random.rand(1) + center[1]

            data.append(tmp)

            theta += 1

            Mat_Label = np.zeros((2, 2), np.float32)

            Mat_Label[0] = center + np.array([-radiu_inner + 0.5, 0])

            Mat_Label[1] = center + np.array([-radiu_outer + 0.5, 0])

            labels = [0, 1]

            Mat_Unlabel = np.vstack(data)

            return Mat_Label, labels, Mat_Unlabel

            def loadBandData(num_unlabel_samples):

            #Mat_Label = np.array([[5.0, 2.], [5.0, 8.0]])

            #labels = [0, 1]

            #Mat_Unlabel = np.array([[5.1, 2.], [5.0, 8.1]])

            Mat_Label = np.array([[5.0, 2.], [5.0, 8.0]])

            labels = [0, 1]

            num_dim = Mat_Label.shape[1]

            Mat_Unlabel = np.zeros((num_unlabel_samples, num_dim), np.float32)

            Mat_Unlabel[:num_unlabel_samples/2, :] = (np.random.rand(num_unlabel_samples/2, num_dim) - 0.5) * np.array([3, 1]) + Mat_Label[0]

            Mat_Unlabel[num_unlabel_samples/2 : num_unlabel_samples, :] = (np.random.rand(num_unlabel_samples/2, num_dim) - 0.5) * np.array([3, 1]) + Mat_Label[1]

            return Mat_Label, labels, Mat_Unlabel

            # main function

            if __name__ == __main__:

            num_unlabel_samples = 800

            #Mat_Label, labels, Mat_Unlabel = loadBandData(num_unlabel_samples)

            Mat_Label, labels, Mat_Unlabel = loadCircleData(num_unlabel_samples)

            ## Notice: when use 'rbf' as our kernel, the choice of hyper parameter 'sigma' is very import! It should be

            ## chose according to your dataset, specific the distance of two data points. I think it should ensure that

            ## each point has about 10 knn or w_i,j is large enough. It also influence the speed of converge. So, may be

            ## 'knn' kernel is better!

            #unlabel_data_labels = labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = 'rbf', rbf_sigma = 0.2)

            unlabel_data_labels = labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = 'knn', knn_num_neighbors = 10, max_iter = 400)

            show(Mat_Label, labels, Mat_Unlabel, unlabel_data_labels)

            該注釋的,代碼都注釋的,有看不明白的,歡迎交流。不同迭代次數(shù)時(shí)候的結(jié)果如下:

            是不是很漂亮的傳播過(guò)程?!在數(shù)值上也是可以看到隨著迭代的進(jìn)行逐漸收斂的,迭代的數(shù)值變化過(guò)程如下:

            ---> Iteration 0/400, changed: 1602.000000

            ---> Iteration 1/400, changed: 6.300182

            ---> Iteration 2/400, changed: 5.129996

            ---> Iteration 3/400, changed: 4.301994

            ---> Iteration 4/400, changed: 3.819295

            ---> Iteration 5/400, changed: 3.501743

            ---> Iteration 6/400, changed: 3.277122

            ---> Iteration 7/400, changed: 3.105952

            ---> Iteration 8/400, changed: 2.967030

            ---> Iteration 9/400, changed: 2.848606

            ---> Iteration 10/400, changed: 2.743997

            ---> Iteration 11/400, changed: 2.649270

            ---> Iteration 12/400, changed: 2.562057

            ---> Iteration 13/400, changed: 2.480885

            ---> Iteration 14/400, changed: 2.404774

            ---> Iteration 15/400, changed: 2.333075

            ---> Iteration 16/400, changed: 2.265301

            ---> Iteration 17/400, changed: 2.201107

            ---> Iteration 18/400, changed: 2.140209

            ---> Iteration 19/400, changed: 2.082354

            ---> Iteration 20/400, changed: 2.027376

            ---> Iteration 21/400, changed: 1.975071

            ---> Iteration 22/400, changed: 1.925286

            ---> Iteration 23/400, changed: 1.877894

            ---> Iteration 24/400, changed: 1.832743

            ---> Iteration 25/400, changed: 1.789721

            ---> Iteration 26/400, changed: 1.748706

            ---> Iteration 27/400, changed: 1.709593

            ---> Iteration 28/400, changed: 1.672284

            ---> Iteration 29/400, changed: 1.636668

            ---> Iteration 30/400, changed: 1.602668

            ---> Iteration 31/400, changed: 1.570200

            ---> Iteration 32/400, changed: 1.539179

            ---> Iteration 33/400, changed: 1.509530

            ---> Iteration 34/400, changed: 1.481182

            ---> Iteration 35/400, changed: 1.454066

            ---> Iteration 36/400, changed: 1.428120

            ---> Iteration 37/400, changed: 1.403283

            ---> Iteration 38/400, changed: 1.379502

            ---> Iteration 39/400, changed: 1.356734

            ---> Iteration 40/400, changed: 1.334906

            ---> Iteration 41/400, changed: 1.313983

            ---> Iteration 42/400, changed: 1.293921

            ---> Iteration 43/400, changed: 1.274681

            ---> Iteration 44/400, changed: 1.256214

            ---> Iteration 45/400, changed: 1.238491

            ---> Iteration 46/400, changed: 1.221474

            ---> Iteration 47/400, changed: 1.205126

            ---> Iteration 48/400, changed: 1.189417

            ---> Iteration 49/400, changed: 1.174316

            ---> Iteration 50/400, changed: 1.159804

            ---> Iteration 51/400, changed: 1.145844

            ---> Iteration 52/400, changed: 1.132414

            ---> Iteration 53/400, changed: 1.119490

            ---> Iteration 54/400, changed: 1.107032

            ---> Iteration 55/400, changed: 1.095054

            ---> Iteration 56/400, changed: 1.083513

            ---> Iteration 57/400, changed: 1.072397

            ---> Iteration 58/400, changed: 1.061671

            ---> Iteration 59/400, changed: 1.051324

            ---> Iteration 60/400, changed: 1.041363

            ---> Iteration 61/400, changed: 1.031742

            ---> Iteration 62/400, changed: 1.022459

            ---> Iteration 63/400, changed: 1.013494

            ---> Iteration 64/400, changed: 1.004836

            ---> Iteration 65/400, changed: 0.996484

            ---> Iteration 66/400, changed: 0.988407

            ---> Iteration 67/400, changed: 0.980592

            ---> Iteration 68/400, changed: 0.973045

            ---> Iteration 69/400, changed: 0.965744

            ---> Iteration 70/400, changed: 0.958682

            ---> Iteration 71/400, changed: 0.951848

            ---> Iteration 72/400, changed: 0.945227

            ---> Iteration 73/400, changed: 0.938820

            ---> Iteration 74/400, changed: 0.932608

            ---> Iteration 75/400, changed: 0.926590

            ---> Iteration 76/400, changed: 0.920765

            ---> Iteration 77/400, changed: 0.915107

            ---> Iteration 78/400, changed: 0.909628

            ---> Iteration 79/400, changed: 0.904309

            ---> Iteration 80/400, changed: 0.899143

            ---> Iteration 81/400, changed: 0.894122

            ---> Iteration 82/400, changed: 0.889259

            ---> Iteration 83/400, changed: 0.884530

            ---> Iteration 84/400, changed: 0.879933

            ---> Iteration 85/400, changed: 0.875464

            ---> Iteration 86/400, changed: 0.871121

            ---> Iteration 87/400, changed: 0.866888

            ---> Iteration 88/400, changed: 0.862773

            ---> Iteration 89/400, changed: 0.858783

            ---> Iteration 90/400, changed: 0.854879

            ---> Iteration 91/400, changed: 0.851084

            ---> Iteration 92/400, changed: 0.847382

            ---> Iteration 93/400, changed: 0.843779

            ---> Iteration 94/400, changed: 0.840274

            ---> Iteration 95/400, changed: 0.836842

            ---> Iteration 96/400, changed: 0.833501

            ---> Iteration 97/400, changed: 0.830240

            ---> Iteration 98/400, changed: 0.827051

            ---> Iteration 99/400, changed: 0.823950

            ---> Iteration 100/400, changed: 0.820906

            ---> Iteration 101/400, changed: 0.817946

            ---> Iteration 102/400, changed: 0.815053

            ---> Iteration 103/400, changed: 0.812217

            ---> Iteration 104/400, changed: 0.809437

            ---> Iteration 105/400, changed: 0.806724

            ---> Iteration 106/400, changed: 0.804076

            ---> Iteration 107/400, changed: 0.801480

            ---> Iteration 108/400, changed: 0.798937

            ---> Iteration 109/400, changed: 0.796448

            ---> Iteration 110/400, changed: 0.794008

            ---> Iteration 111/400, changed: 0.791612

            ---> Iteration 112/400, changed: 0.789282

            ---> Iteration 113/400, changed: 0.786984

            ---> Iteration 114/400, changed: 0.784728

            ---> Iteration 115/400, changed: 0.782516

            ---> Iteration 116/400, changed: 0.780355

            ---> Iteration 117/400, changed: 0.778216

            ---> Iteration 118/400, changed: 0.776139

            ---> Iteration 119/400, changed: 0.774087

            ---> Iteration 120/400, changed: 0.772072

            ---> Iteration 121/400, changed: 0.770085

            ---> Iteration 122/400, changed: 0.768146

            ---> Iteration 123/400, changed: 0.766232

            ---> Iteration 124/400, changed: 0.764356

            ---> Iteration 125/400, changed: 0.762504

            ---> Iteration 126/400, changed: 0.760685

            ---> Iteration 127/400, changed: 0.758889

            ---> Iteration 128/400, changed: 0.757135

            ---> Iteration 129/400, changed: 0.755406

            四、LP算法MPI并行實(shí)現(xiàn)

            這里,我們測(cè)試的是LP的變身版本。從公式,我們可以看到,第二項(xiàng)PULYL迭代過(guò)程并沒(méi)有發(fā)生變化,所以這部分實(shí)際上從迭代開(kāi)始就可以計(jì)算好,從而避免重復(fù)計(jì)算。不過(guò),不管怎樣,LP算法都要計(jì)算一個(gè)UxU的矩陣PUU和一個(gè)UxC矩陣FU的乘積。當(dāng)我們的unlabeled數(shù)據(jù)非常多,而且類別也很多的時(shí)候,計(jì)算是很慢的,同時(shí)占用的內(nèi)存量也非常大。另外,構(gòu)造Graph需要計(jì)算兩兩的相似度,也是O(n2)的復(fù)雜度,當(dāng)我們數(shù)據(jù)的特征維度很大的時(shí)候,這個(gè)計(jì)算量也是非??陀^的。所以我們就得考慮并行處理了。而且最好是能放到集群上并行。那如何并行呢?

            對(duì)算法的并行化,一般分為兩種:數(shù)據(jù)并行和模型并行。

            數(shù)據(jù)并行很好理解,就是將數(shù)據(jù)劃分,每個(gè)節(jié)點(diǎn)只處理一部分?jǐn)?shù)據(jù),例如我們構(gòu)造圖的時(shí)候,計(jì)算每個(gè)數(shù)據(jù)的k近鄰。例如我們有1000個(gè)樣本和20個(gè)CPU節(jié)點(diǎn),那么就平均分發(fā),讓每個(gè)CPU節(jié)點(diǎn)計(jì)算50個(gè)樣本的k近鄰,然后最后再合并大家的結(jié)果??梢?jiàn)這個(gè)加速比也是非??捎^的。

            模型并行一般發(fā)生在模型很大,無(wú)法放到單機(jī)的內(nèi)存里面的時(shí)候。例如龐大的深度神經(jīng)網(wǎng)絡(luò)訓(xùn)練的時(shí)候,就需要把這個(gè)網(wǎng)絡(luò)切開(kāi),然后分別求解梯度,最后有個(gè)leader的節(jié)點(diǎn)來(lái)收集大家的梯度,再反饋給大家去更新。當(dāng)然了,其中存在更細(xì)致和高效的工程處理方法。在我們的LP算法中,也是可以做模型并行的。假如我們的類別數(shù)C很大,把類別數(shù)切開(kāi),讓不同的CPU節(jié)點(diǎn)處理,實(shí)際上就相當(dāng)于模型并行了。

            那為啥不切大矩陣PUU,而是切小點(diǎn)的矩陣FU,因?yàn)榇缶仃嘝UU沒(méi)法獨(dú)立分塊,并行的一個(gè)原則是處理必須是獨(dú)立的。 矩陣FU依賴的是所有的U,而把PUU切開(kāi)分發(fā)到其他節(jié)點(diǎn)的時(shí)候,每次FU的更新都需要和其他的節(jié)點(diǎn)通信,這個(gè)通信的代價(jià)是很大的(實(shí)際上,很多并行系統(tǒng)沒(méi)法達(dá)到線性的加速度的瓶頸是通信!線性加速比是,我增加了n臺(tái)機(jī)器,速度就提升了n倍)。但是對(duì)類別C也就是矩陣FU切分,就不會(huì)有這個(gè)問(wèn)題,因?yàn)樗麄兊挠?jì)算是獨(dú)立的。只是決定樣本的最終類別的時(shí)候,將所有的FU收集回來(lái)求max就可以了。

            所以,在下面的代碼中,是同時(shí)包含了數(shù)據(jù)并行和模型并行的雛形的。另外,還值得一提的是,我們是迭代算法,那決定什么時(shí)候迭代算法停止?除了判斷收斂外,我們還可以讓每迭代幾步,就用測(cè)試label測(cè)試一次結(jié)果,看模型的整體訓(xùn)練性能如何。特別是判斷訓(xùn)練是否過(guò)擬合的時(shí)候非常有效。因此,代碼中包含了這部分內(nèi)容。

            好了,代碼終于來(lái)了。大家可以搞點(diǎn)大數(shù)據(jù)庫(kù)來(lái)測(cè)試,如果有MPI集群條件的話就更好了。

            下面的代碼依賴numpy、scipy(用其稀疏矩陣加速計(jì)算)和mpi4py。其中mpi4py需要依賴openmpi和Cpython,可以參考我之前的博客進(jìn)行安裝。

            #***************************************************************************

            #*

            #* Description: label propagation

            #* Author: Zou Xiaoyi ([email protected])

            #* Date: 2015-10-15

            #* HomePage: http://blog.csdn.net/zouxy09

            #*

            #**************************************************************************

            import os, sys, time

            import numpy as np

            from scipy.sparse import csr_matrix, lil_matrix, eye

            import operator

            import cPickle as pickle

            import mpi4py.MPI as MPI

            #

            # Global variables for MPI

            #

            # instance for invoking MPI related functions

            comm = MPI.COMM_WORLD

            # the node rank in the whole community

            comm_rank = comm.Get_rank()

            # the size of the whole community, i.e., the total number of working nodes in the MPI cluster

            comm_size = comm.Get_size()

            # load mnist dataset

            def load_MNIST():

            import gzip

            f = gzip.open(mnist.pkl.gz, rb)

            train, val, test = pickle.load(f)

            f.close()

            Mat_Label = train[0]

            labels = train[1]

            Mat_Unlabel = test[0]

            groundtruth = test[1]

            labels_id = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

            return Mat_Label, labels, labels_id, Mat_Unlabel, groundtruth

            # return k neighbors index

            def navie_knn(dataSet, query, k):

            numSamples = dataSet.shape[0]

            ## step 1: calculate Euclidean distance

            diff = np.tile(query, (numSamples, 1)) - dataSet

            squaredDiff = diff ** 2

            squaredDist = np.sum(squaredDiff, axis = 1) # sum is performed by row

            ## step 2: sort the distance

            sortedDistIndices = np.argsort(squaredDist)

            if k > len(sortedDistIndices):

            k = len(sortedDistIndices)

            return sortedDistIndices[0:k]

            # build a big graph (normalized weight matrix)

            # sparse U x (U + L) matrix

            def buildSubGraph(Mat_Label, Mat_Unlabel, knn_num_neighbors):

            num_unlabel_samples = Mat_Unlabel.shape[0]

            data = []; indices = []; indptr = [0]

            Mat_all = np.vstack((Mat_Label, Mat_Unlabel))

            values = np.ones(knn_num_neighbors, np.float32) / knn_num_neighbors

            for i in xrange(num_unlabel_samples):

            k_neighbors = navie_knn(Mat_all, Mat_Unlabel[i, :], knn_num_neighbors)

            indptr.append(np.int32(indptr[-1]) + knn_num_neighbors)

            indices.extend(k_neighbors)

            data.append(values)

            return csr_matrix((np.hstack(data), indices, indptr))

            # build a big graph (normalized weight matrix)

            # sparse U x (U + L) matrix

            def buildSubGraph_MPI(Mat_Label, Mat_Unlabel, knn_num_neighbors):

            num_unlabel_samples = Mat_Unlabel.shape[0]

            local_data = []; local_indices = []; local_indptr = [0]

            Mat_all = np.vstack((Mat_Label, Mat_Unlabel))

            values = np.ones(knn_num_neighbors, np.float32) / knn_num_neighbors

            sample_offset = np.linspace(0, num_unlabel_samples, comm_size + 1).astype('int')

            for i in range(sample_offset[comm_rank], sample_offset[comm_rank+1]):

            k_neighbors = navie_knn(Mat_all, Mat_Unlabel[i, :], knn_num_neighbors)

            local_indptr.append(np.int32(local_indptr[-1]) + knn_num_neighbors)

            local_indices.extend(k_neighbors)

            local_data.append(values)

            data = np.hstack(comm.allgather(local_data))

            indices = np.hstack(comm.allgather(local_indices))

            indptr_tmp = comm.allgather(local_indptr)

            indptr = []

            for i in range(len(indptr_tmp)):

            if i == 0:

            indptr.extend(indptr_tmp[i])

            else:

            last_indptr = indptr[-1]

            del(indptr[-1])

            indptr.extend(indptr_tmp[i] + last_indptr)

            return csr_matrix((np.hstack(data), indices, indptr), dtype = np.float32)

            # label propagation

            def run_label_propagation_sparse(knn_num_neighbors = 20, max_iter = 100, tol = 1e-4, test_per_iter = 1):

            # load data and graph

            print Processor %d/%d loading graph file... % (comm_rank, comm_size)

            #Mat_Label, labels, Mat_Unlabel, groundtruth = loadFourBandData()

            Mat_Label, labels, labels_id, Mat_Unlabel, unlabel_data_id = load_MNIST()

            if comm_size > len(labels_id):

            raise ValueError(Sorry, the processors must be less than the number of classes)

            #affinity_matrix = buildSubGraph(Mat_Label, Mat_Unlabel, knn_num_neighbors)

            affinity_matrix = buildSubGraph_MPI(Mat_Label, Mat_Unlabel, knn_num_neighbors)

            # get some parameters

            num_classes = len(labels_id)

            num_label_samples = len(labels)

            num_unlabel_samples = Mat_Unlabel.shape[0]

            affinity_matrix_UL = affinity_matrix[:, 0:num_label_samples]

            affinity_matrix_UU = affinity_matrix[:, num_label_samples:num_label_samples+num_unlabel_samples]

            if comm_rank == 0:

            print Have %d labeled images, %d unlabeled images and %d classes % (num_label_samples, num_unlabel_samples, num_classes)

            # divide label_function_U and label_function_L to all processors

            class_offset = np.linspace(0, num_classes, comm_size + 1).astype('int')

            # initialize local label_function_U

            local_start_class = class_offset[comm_rank]

            local_num_classes = class_offset[comm_rank+1] - local_start_class

            local_label_function_U = eye(num_unlabel_samples, local_num_classes, 0, np.float32, format='csr')

            # initialize local label_function_L

            local_label_function_L = lil_matrix((num_label_samples, local_num_classes), dtype = np.float32)

            for i in xrange(num_label_samples):

            class_off = int(labels[i]) - local_start_class

            if class_off >= 0 and class_off local_num_classes:

            local_label_function_L[i, class_off] = 1.0

            local_label_function_L = local_label_function_L.tocsr()

            local_label_info = affinity_matrix_UL.dot(local_label_function_L)

            print Processor %d/%d has to process %d classes... % (comm_rank, comm_size, local_label_function_L.shape[1])

            # start to propagation

            iter = 1; changed = 100.0;

            evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id)

            while True:

            pre_label_function = local_label_function_U.copy()

            # propagation

            local_label_function_U = affinity_matrix_UU.dot(local_label_function_U) + local_label_info

            # check converge

            local_changed = abs(pre_label_function - local_label_function_U).sum()

            changed = comm.reduce(local_changed, root = 0, op = MPI.SUM)

            status = 'RUN'

            test = False

            if comm_rank == 0:

            if iter % 1 == 0:

            norm_changed = changed / (num_unlabel_samples * num_classes)

            print ---> Iteration %d/%d, changed: %f % (iter, max_iter, norm_changed)

            if iter >= max_iter or changed tol:

            status = 'STOP'

            print ************** Iteration over! ****************

            if iter % test_per_iter == 0:

            test = True

            iter += 1

            test = comm.bcast(test if comm_rank == 0 else None, root = 0)

            status = comm.bcast(status if comm_rank == 0 else None, root = 0)

            if status == 'STOP':

            break

            if test == True:

            evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id)

            evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id)

            def evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id):

            # get local label with max score

            if comm_rank == 0:

            print Start to combine local result...

            local_max_score = np.zeros((num_unlabel_samples, 1), np.float32)

            local_max_label = np.zeros((num_unlabel_samples, 1), np.int32)

            for i in xrange(num_unlabel_samples):

            local_max_label[i, 0] = np.argmax(local_label_function_U.getrow(i).todense())

            local_max_score[i, 0] = local_label_function_U[i, local_max_label[i, 0]]

            local_max_label[i, 0] += local_start_class

            # gather the results from all the processors

            if comm_rank == 0:

            print Start to gather results from all processors

            all_max_label = np.hstack(comm.allgather(local_max_label))

            all_max_score = np.hstack(comm.allgather(local_max_score))

            # get terminate label of unlabeled data

            if comm_rank == 0:

            print Start to analysis the results...

            right_predict_count = 0

            for i in xrange(num_unlabel_samples):

            if i % 1000 == 0:

            print ***, all_max_score[i]

            max_idx = np.argmax(all_max_score[i])

            max_label = all_max_label[i, max_idx]

            if int(unlabel_data_id[i]) == int(labels_id[max_label]):

            right_predict_count += 1

            accuracy = float(right_predict_count) * 100.0 / num_unlabel_samples

            print Have %d samples, accuracy: %.3f%%! % (num_unlabel_samples, accuracy)

            if __name__ == '__main__':

            run_label_propagation_sparse(knn_num_neighbors = 20, max_iter = 30)



            關(guān)鍵詞:

            評(píng)論


            相關(guān)推薦

            技術(shù)專區(qū)

            關(guān)閉