基於手寫數字數據庫MNIST，應用最小二乘法模型對數字$0$和其他數字進行分類

1. 目的與要求#

目的： 圖像分類問題，基於手寫數字數據庫 MNIST, 應用最小二乘法模型對數字 $0$ 和其他數字進行分類。

MNIST 數據庫介紹： MNIST 數據庫來自美國國家標準與技術研究所，包含 70000 張 28×28（784 個像素）的手寫數字灰度圖像。圖像數據已經被轉化為 $28 × 28 = 784$ 維的向量形式存儲（$784$ 個特徵）；標籤以 $1$ 維向量形式存儲（每個元素表示圖片對應的數字）。

要求： 首先用最小二乘模型將 MNIST 數據庫中的數字 $0$ 和其他數字進行分類，給出模型在訓練集和測試集上的準確率；然後添加 $5000$ 個隨機特徵重新分類，得出新的準確率；最後對比兩次實驗結果，給出實驗結論。

2. 过程及结果#

2.1 應用最小二乘模型進行分類#

MNIST 數據庫將數據分成 2 組，一組是包含 $60000$ 張圖像的訓練集，另一組是包 $10000$ 張圖像的測試集。

特徵選擇：為了降低維度（維度太高，計算量太大）和噪聲（很多像素點的灰度值都是 $0$ 或接近 $0$，沒有太多信息），提高模型的效率和準確性，將特徵矩陣設置為 $494$ 維的向量，第 $1$ 維是常量 $1$（表示截距或偏置）, 其餘 $493$ 維是至少在 $600$ 張圖片中像素值不為 $0$ 的像素。如果圖像為數字 $0$, 則取 $y= 1$, 否則取 $y= −1$。
最下二乘法的標準模型為：

$Ax=b$，其最小二乘解為：$x=(A^TA)^{-1} A^Tb$
- $A$ 矩陣為特徵矩陣，通過訓練數據來構造
- $b$ 向量是對應的訓練數據數字標籤
這裡不直接求解其最小二乘解，採用 QR 分解法求解最小二乘解，以提高效率。

2.1.1 具體步驟#

從 MINIST 數據庫中加載已經劃分好的訓練數據集和測試數據集。

包括以下函數（列寫函數名稱，函數具體內容見文後完整代碼）：

decode_idx3_ubyte(idx3_ubyte_file):解析idx3文件的通用函數

decode_idx1_ubyte(idx1_ubyte_file):解析idx1文件通用函數

load_train_images(idx_ubyte_file=train_images_idx3_ubyte_file):返回訓練數據

load_train_labels(idx_ubyte_file=train_labels_idx1_ubyte_file):返回訓練數據標籤

load_test_images(idx_ubyte_file=test_images_idx3_ubyte_file):返回測試數據

load_test_labels(idx_ubyte_file=test_labels_idx1_ubyte_file):返回測試數據標籤
數據預處理：

維度轉換：將原始數據集的標籤和圖像數組進行 reshape 操作，將三維數組轉換為二維數組。

標籤轉換：將標籤轉換為 $1$ 和 $-1$，如果數字為 $0$，則轉換為 $1$，否則為 $-1$。

特徵選擇：找出至少在 $600$ 張圖片中不為 0 的像素位置，提高模型的效率和準確度

將特徵矩陣 $\mathcal {train_images}$ 設置為 $494$ 維的向量，構建新的數據集：

① 找出至少在 $600$ 張圖片中不為 0 的像素位置，並記錄哪些列被選做特徵。

② 創建一個用於存儲訓練圖像特徵的數組 train_image_feature，並將具有足夠非零元素的列複製到該數組的第 j 列。

③ 創建一個用於存儲測試圖像特徵的數組 test_image_feature，並使用與 train_image_feature 相同的索引從 test_images 中選擇相同的列作為特徵。

④ 在訓練集和測試集特徵向量的開頭添加常量 $1$ 作為偏置。

⑤ 返回處理後的訓練和測試數據 train_labels, test_labels, train_images, test_images。

def pretreat(train_labels, test_labels, train_images, test_images):
  """
  數據預處理函數：特徵選擇
  :param 原始數據集
  :return: 處理好的新數據集
  """
  # 將三維數組轉換為二維數組
  train_labels = train_labels.reshape(60000)
  test_labels = test_labels.reshape(10000)
  train_images = train_images.reshape(60000, 784)
  test_images = test_images.reshape(10000, 784)

  #  把標籤轉換為1和-1，數字為0則為1，否則為-1
  # np.where(condition,x,y):如果condition為真，返回x,否則返回y
  train_labels = np.where(train_labels == 0, 1, -1)
  test_labels = np.where(test_labels == 0, 1, -1)

  # 找出至少在600張圖片中不為0的像素位置
  j = 0
  index = []
  train_image_feature = np.zeros([60000, 493])  # 用於存儲訓練圖像的特徵
  for i in range(784):
      non_zero = np.linalg.norm(train_images[:, i], ord=0)  # 計算每一列中非零元素的個數
      if non_zero >= 600:
          train_image_feature[:, j] = train_images[:, i]  # 將具有足夠非零元素的列複製到train_image_feature的第j列
          j = j + 1
          index.append(i)  # 記錄哪些列被選做特徵
  test_image_feature = np.zeros([10000, 493])  # 創建一個零數組，用於存儲測試圖像的特徵
  test_image_feature = test_images[:, index]  # 使用 index 從 test_images 中選擇相同的列作為特徵，並賦值給 test_image_feature

  # 在訓練集和測試集特徵向量開頭添加常量1作為偏置
  train_images = np.insert(train_image_feature, 0, 1, axis=1)  
  test_images = np.insert(test_image_feature, 0, 1, axis=1)  

  return train_labels, test_labels, train_images, test_images

訓練分類模型及模型評估

採用矩陣分解的最小二乘問題求解算法如下：

① 計算矩陣 $A$ 的 $QR$ 分解，$A=QR$

② 計算 $Q^Tb$

③ 求解 $Rx=Q^Tb$，求得 $x$

函數clf_1(X_train, y_train, X_test, y_test):訓練分類模型及模型評估

函數tran1or_0(result):將結果中大於 $0$ 的數轉換為 $1$，小於 $0$ 的轉換為 $0$

函數show_result(labels, result, dataset):計算錯誤率及結果展示

def clf_1(X_train, y_train, X_test, y_test):
    """
    :param X_train: 訓練數據
    :param y_train: 訓練數據標籤
    :param X_test: 測試數據
    :param y_test: 測試數據標籤
    :return: 
    """
    # 開始訓練
    start1 = time.time()
    A = X_train
    A_test = X_test
    b_train = y_train
    b_test = y_test
    print('進行QR分解中...')
    q, r = np.linalg.qr(A)
    print('已完成QR分解')
    print('\n')
    print('對x進行求解中...')
    x = np.linalg.pinv(r).dot(q.T.dot(b_train))
    print('已求得x的解')
    result = A.dot(x)
    print('已完成結果預測')
    tran1or_0(result)
    end1 = time.time()
    # 結果展示分析
    show_result(train_labels, result, 'Train_dataset')
    result_test = A_test.dot(x)
    tran1or_0(result_test)
    show_result(test_labels, result_test, 'Test_dataset')

    print("運行時間:%.2f秒\n" % (end1 - start1))

2.1.2 結果展示#

2.2 添加 5000 個隨機特徵重新分類#

添加隨機特徵

利用隨機特徵法構造 隨機特徵矩陣：隨機矩陣 $R\in R_{5000\times 494}$，$R_{ij}$ 隨機取值 $\pm1$，用如下方法計算作為新特徵，與原來的特徵結合後一共 $5494$ 個特徵。

$\max(0,(R_{x})_j),\ j=1,2,3,\ldots,5000$

添加 $5000$ 個隨機特徵的實際意義是，通過 $\max$ 函數隨機的保留特徵，如果該特徵比 $0$ 大，就保留原來的特徵，如果隨機的是 $-1$ 比 $0$ 小，$\max$ 取 $0$，不保留這個特徵。

因此 $A$ 矩陣與 $R$ 矩陣的轉置相乘，得到了 $60000\times5000$ 的矩陣，拼接到 $A (60000\times494)$ 的後面變得到了新的 $A (60000\times 5494)$ 矩陣，剩下的步驟和上述相同。

2.2.1 具體步驟#

加載數據集與數據預處理同上
訓練分類模型及模型評估

函數clf_2(X_train, y_train, X_test, y_test):

def clf_2(X_train, y_train, X_test, y_test):
    """
    :param X_train: 訓練數據
    :param y_train: 訓練數據標籤
    :param X_test: 測試數據
    :param y_test: 測試數據標籤
    :return: 
    """
    Random_feature = np.random.choice([-1, 1], (5000, 494))
    # 開始訓練
    start2 = time.time()
    A = X_train
    A_test = X_test
    A_random = A.dot(Random_feature.T)
    A_random[A_random < 0] = 0
    A_add = np.hstack([X_train, A_random])
    b_train = y_train
    b_test = y_test
    print('進行QR分解中...')
    q_add, r_add = np.linalg.qr(A_add)
    print('已完成QR分解')
    print('對x進行求解中...')
    x_add = np.linalg.pinv(r_add).dot(q_add.T.dot(b_train))
    print('已求得x的解')
    result_add = A_add.dot(x_add)
    print('已完成結果預測')
    tran1or_0(result_add)
    end2 = time.time()
    # 結果展示分析
    show_result(train_labels, result_add, 'Train_dataset_ADD')
    A_random_test = np.dot(A_test, Random_feature.T)  ##10000*50000
    A_random_test[A_random_test < 0] = 0
    A_add_test = np.hstack([A_test, A_random_test])  ## 10000*5494
    result_add_test = A_add_test.dot(x_add)
    tran1or_0(result_add_test)
    show_result(test_labels, result_add_test, 'Test_dataset_ADD')

    print("運行時間:%.2f秒\n" % (end2 - start2))

2.2.2 結果展示#

3. 結果及分析#

	訓練集分類錯誤率	測試集分類錯誤率	運行時間 (s)
$494$ 個特徵數據集	$1.553%$	$1.580%$	$1.38$
$5494$ 個特徵數據集	$0.218%$	$0.220%$	$145.23$

準確率：增加 $5000$ 個隨機矩陣後，增加了模型的複雜度，模型可以學習到更泛化的特徵，提高模型的性能和泛化能力，從而在數據集上獲得更好的性能。從實驗結果可以看出，模型在訓練集和測試集上準確率都得到了很大的提升。
運行時間分析：增加隨機特徵後，將 $A$ 矩陣擴充，錯誤率顯著降低。但是運行時間很慢，原因是生成的隨機矩陣過大，約 $60000\times 5000=3$ 億個數據。

4. 完整代碼#

import numpy as np
import pandas as pd
import struct
import prettytable as pt
import time

# 訓練集文件
train_images_idx3_ubyte_file = r'C:\Users\路正橋\Documents\train-images.idx3-ubyte'
# 訓練集標籤文件
train_labels_idx1_ubyte_file = r'C:\Users\路正橋\Documents\train-labels.idx1-ubyte'

# 測試集文件
test_images_idx3_ubyte_file = r'C:\Users\路正橋\Documents\t10k-images.idx3-ubyte'
# 測試集標籤文件
test_labels_idx1_ubyte_file = r'C:\Users\路正橋\Documents\t10k-labels.idx1-ubyte'


def decode_idx3_ubyte(idx3_ubyte_file):
    """
    解析idx3文件的通用函數
    :param idx3_ubyte_file: idx3文件路徑
    :return: 數據集
    """
    # 讀取二進制數據
    bin_data = open(idx3_ubyte_file, 'rb').read()

    # 解析文件頭信息，依次為魔數、圖片數量、每張圖片高、每張圖片寬
    offset = 0
    fmt_header = '>iiii'  # 因為數據結構中前4行的數據類型都是32位整型，所以採用i格式，但我們需要讀取前4行數據，所以需要4個i。我們後面會看到標籤集中，只使用2個ii。
    magic_number, num_images, num_rows, num_cols = struct.unpack_from(fmt_header, bin_data, offset)
    print('圖片數量: %d張, 圖片大小: %d*%d\n' % (num_images, num_rows, num_cols))

    # 解析數據集
    image_size = num_rows * num_cols
    offset += struct.calcsize(fmt_header)  # 獲得數據在緩存中的指針位置，從前面介紹的數據結構可以看出，讀取了前4行之後，指針位置（即偏移位置offset）指向0016。
    fmt_image = '>' + str(
        image_size) + 'B'  # 圖像數據像素值的類型為unsigned char型，對應的format格式為B。這裡還加上圖像大小784，是為了讀取784個B格式數據，如果沒有則只會讀取一個值（即一副圖像中的一個像素值）
    images = np.empty((num_images, num_rows, num_cols))
    for i in range(num_images):
        images[i] = np.array(struct.unpack_from(fmt_image, bin_data, offset)).reshape((num_rows, num_cols))
        # print(images[i])
        offset += struct.calcsize(fmt_image)

    return images


def decode_idx1_ubyte(idx1_ubyte_file):
    """
    解析idx1文件的通用函數
    :param idx1_ubyte_file: idx1文件路徑
    :return: 數據集
    """
    # 讀取二進制數據
    bin_data = open(idx1_ubyte_file, 'rb').read()

    # 解析文件頭信息，依次為魔數和標籤數
    offset = 0
    fmt_header = '>ii'
    magic_number, num_images = struct.unpack_from(fmt_header, bin_data, offset)
    print('標籤數量: %d張\n' % (num_images))

    # 解析數據集
    offset += struct.calcsize(fmt_header)
    fmt_image = '>B'
    labels = np.empty(num_images)
    for i in range(num_images):
        labels[i] = struct.unpack_from(fmt_image, bin_data, offset)[0]
        offset += struct.calcsize(fmt_image)
    return labels


def load_train_images(idx_ubyte_file=train_images_idx3_ubyte_file):
    """
    :param idx_ubyte_file: idx文件路徑
    :return: n*row*col維np.array對象，n為圖片數量
    """
    print("訓練圖片數據已解析：")
    return decode_idx3_ubyte(idx_ubyte_file)


def load_train_labels(idx_ubyte_file=train_labels_idx1_ubyte_file):
    """
    :param idx_ubyte_file: idx文件路徑
    :return: n*1維np.array對象，n為圖片數量
    """
    print("訓練標籤數據已解析：")
    return decode_idx1_ubyte(idx_ubyte_file)


def load_test_images(idx_ubyte_file=test_images_idx3_ubyte_file):
    """
    :param idx_ubyte_file: idx文件路徑
    :return: n*row*col維np.array對象，n為圖片數量
    """
    print("測試圖片數據已解析：")
    return decode_idx3_ubyte(idx_ubyte_file)


def load_test_labels(idx_ubyte_file=test_labels_idx1_ubyte_file):
    """
    :param idx_ubyte_file: idx文件路徑
    :return: n*1維np.array對象，n為圖片數量
    """
    print("測試標籤數據已解析：")
    return decode_idx1_ubyte(idx_ubyte_file)


def pretreat(train_labels, test_labels, train_images, test_images):
    """
    數據預處理函數：特徵選擇
    :param 原始數據集
    :return: 處理好的新數據集
    """
    # 將三維數組轉換為二維數組
    train_labels = train_labels.reshape(60000)
    test_labels = test_labels.reshape(10000)
    train_images = train_images.reshape(60000, 784)
    test_images = test_images.reshape(10000, 784)

    #  把標籤轉換為1和-1，數字為0則為1，否則為-1
    # np.where(condition,x,y):如果condition為真，返回x,否則返回y
    train_labels = np.where(train_labels == 0, 1, -1)
    test_labels = np.where(test_labels == 0, 1, -1)

    # 找出至少在600張圖片中不為0的像素位置
    j = 0
    index = []
    train_image_feature = np.zeros([60000, 493])  # 用於存儲訓練圖像的特徵
    for i in range(784):
        non_zero = np.linalg.norm(train_images[:, i], ord=0)  # 計算每一列中非零元素的個數
        if non_zero >= 600:
            train_image_feature[:, j] = train_images[:, i]  # 將具有足夠非零元素的列複製到train_image_feature的第j列
            j = j + 1
            index.append(i)  # 記錄哪些列被選做特徵
    test_image_feature = np.zeros([10000, 493])  # 創建一個零數組，用於存儲測試圖像的特徵
    test_image_feature = test_images[:, index]  # 使用 index 從 test_images 中選擇相同的列作為特徵，並賦值給 test_image_feature

    # 在訓練集和測試集特徵向量開頭添加常量1作為偏置
    train_images = np.insert(train_image_feature, 0, 1, axis=1)
    test_images = np.insert(test_image_feature, 0, 1, axis=1)

    return train_labels, test_labels, train_images, test_images


def clf_1(X_train, y_train, X_test, y_test):
    """
    :param X_train: 訓練數據
    :param y_train: 訓練數據標籤
    :param X_test: 測試數據
    :param y_test: 測試數據標籤
    :return: 
    """
    # 開始訓練
    start1 = time.time()
    A = X_train
    A_test = X_test
    b_train = y_train
    b_test = y_test
    print('進行QR分解中...')
    q, r = np.linalg.qr(A)
    print('已完成QR分解')
    print('\n')
    print('對x進行求解中...')
    x = np.linalg.pinv(r).dot(q.T.dot(b_train))
    print('已求得x的解')
    result = A.dot(x)
    print('已完成結果預測')
    tran1or_0(result)
    end1 = time.time()
    # 結果展示分析
    show_result(train_labels, result, 'Train_dataset')
    result_test = A_test.dot(x)
    tran1or_0(result_test)
    show_result(test_labels, result_test, 'Test_dataset')

    print("運行時間:%.2f秒\n" % (end1 - start1))


def clf_2(X_train, y_train, X_test, y_test):
    """
    :param X_train: 訓練數據
    :param y_train: 訓練數據標籤
    :param X_test: 測試數據
    :param y_test: 測試數據標籤
    :return: 
    """
    Random_feature = np.random.choice([-1, 1], (5000, 494))
    # 開始訓練
    start2 = time.time()
    A = X_train
    A_test = X_test
    A_random = A.dot(Random_feature.T)
    A_random[A_random < 0] = 0
    A_add = np.hstack([X_train, A_random])
    b_train = y_train
    b_test = y_test
    print('進行QR分解中...')
    q_add, r_add = np.linalg.qr(A_add)
    print('已完成QR分解')
    print('對x進行求解中...')
    x_add = np.linalg.pinv(r_add).dot(q_add.T.dot(b_train))
    print('已求得x的解')
    result_add = A_add.dot(x_add)
    print('已完成結果預測')
    tran1or_0(result_add)
    end2 = time.time()
    # 結果展示分析
    show_result(train_labels, result_add, 'Train_dataset_ADD')
    A_random_test = np.dot(A_test, Random_feature.T)  ##10000*50000
    A_random_test[A_random_test < 0] = 0
    A_add_test = np.hstack([A_test, A_random_test])  ## 10000*5494
    result_add_test = A_add_test.dot(x_add)
    tran1or_0(result_add_test)
    show_result(test_labels, result_add_test, 'Test_dataset_ADD')

    print("運行時間:%.2f秒\n" % (end2 - start2))


def tran1or_0(result):
    """
    將大於0的數轉換為1，小於0的轉換為0
    :param
    :return:
    """
    result[result > 0] = 1
    result[result < 0] = -1
    result.reshape(len(result), 1)
    return result


def show_result(labels, result, dataset):
    """
    計算錯誤率及結果展示
    :param labels: 真實數據標籤
    :param result: 預測得到的數據標籤
    :param dataset: 數據名稱（train or test）
    :return: 
    """
    TP = 0  # 正類預測為正類
    FN = 0  # 正類預測為負類
    FP = 0  # 負類預測為正類
    TN = 0  # 負類預測為負類
    for i in range(len(labels)):
        if labels[i] == 1 and result[i] == 1:
            TP = TP + 1
        elif labels[i] == 1 and result[i] == -1:
            FN = FN + 1
        elif labels[i] == -1 and result[i] == 1:
            FP = FP + 1
        elif labels[i] == -1 and result[i] == -1:
            TN = TN + 1

    data = {
        dataset: ['y=+1', 'y=-1', 'All'],
        'Predicted y=+1': [TP, FN, TP + FN],
        'Prediected y=-1': [FP, TN, FP + TN],
        'Total': [TP + FP, FN + TN, TP + FP + FN + TN]
    }
    df = pd.DataFrame(data)
    # 設置表格樣式
    styled_df = df.style \
        .set_properties(**{'text-align': 'center', 'border': '1px solid black'}) \
        .set_table_styles([{'selector': 'th', 'props': [('border', '1px solid black')]}]) \
        .set_caption(dataset + '預測結果：')
    # 輸出表格
    display(styled_df)

    error = (FN + FP) / (TP + FP + FN + TN) * 100
    print('錯誤率為', "%.3f" % error, '%')
    print('\n')


if __name__ == '__main__':
    # 加載原始數據
    train_images = load_train_images()
    train_labels = load_train_labels()
    test_images = load_test_images()
    test_labels = load_test_labels()
    # 得到數組形式的數據，此時數組為三維的
    [train_labels, test_labels, train_images, test_images] = pretreat(train_labels, test_labels, train_images,
                                                                      test_images)
    # 訓練模型得出結果
    clf_1(train_images, train_labels, test_images, test_labels)
    # 增加5000個隨機特徵後重新訓練模型，評估模型
    print("增加5000個隨機特徵後的結果：\n")
    clf_2(train_images, train_labels, test_images, test_labels)