1. Purpose and Requirements#
Purpose: The image classification problem, based on the handwritten digit database MNIST, applies the least squares model to classify the digit $0$ and other digits.
Introduction to the MNIST Database: The MNIST database comes from the National Institute of Standards and Technology in the United States and contains 70,000 grayscale images of handwritten digits, each of size 28×28 (784 pixels). The image data has been transformed into a vector form of $28 × 28 = 784$ dimensions (with $784$ features); the labels are stored in a 1-dimensional vector form (each element represents the digit corresponding to the image).
Requirements: First, classify the digits $0$ and other digits in the MNIST database using the least squares model, and provide the model's accuracy on the training and testing sets; then add $5000$ random features for reclassification and obtain the new accuracy; finally, compare the results of the two experiments and provide a conclusion.
2. Process and Results#
2.1 Classification Using the Least Squares Model#
The MNIST database divides the data into 2 groups, one group is the training set containing $60,000$ images, and the other group is the testing set containing $10,000$ images.
-
Feature Selection: To reduce dimensionality (high dimensionality leads to high computational load) and noise (many pixel values are $0$ or close to $0$, providing little information), and to improve the efficiency and accuracy of the model, the feature matrix is set as a $494$-dimensional vector, where the first dimension is a constant $1$ (representing the intercept or bias), and the remaining $493$ dimensions are pixels that are not $0$ in at least $600$ images. If the image is the digit $0$, then set $y= 1$, otherwise set $y= −1$.
-
The standard model for least squares is:
$Ax=b$, with the least squares solution given by: $x=(A^TA)^{-1}A^Tb$
- The matrix $A$ is the feature matrix constructed from the training data.
- The vector $b$ corresponds to the training data digit labels.
Here, we do not directly solve for the least squares solution but use the QR decomposition method to solve for the least squares solution to improve efficiency.
2.1.1 Specific Steps#
-
Load the pre-split training and testing datasets from the MNIST database.
Includes the following functions (function names listed, see complete code at the end for details):
decode_idx3_ubyte(idx3_ubyte_file):
A general function to parseidx3
files.decode_idx1_ubyte(idx1_ubyte_file):
A general function to parseidx1
files.load_train_images(idx_ubyte_file=train_images_idx3_ubyte_file):
Returns training data.load_train_labels(idx_ubyte_file=train_labels_idx1_ubyte_file):
Returns training data labels.load_test_images(idx_ubyte_file=test_images_idx3_ubyte_file):
Returns testing data.load_test_labels(idx_ubyte_file=test_labels_idx1_ubyte_file):
Returns testing data labels. -
Data Preprocessing:
-
Dimensionality Transformation: Reshape the labels and image arrays of the original dataset, converting the three-dimensional array into a two-dimensional array.
-
Label Transformation: Convert labels to $1$ and $-1$, if the digit is $0$, convert it to $1$, otherwise to $-1$.
-
Feature Selection: Identify pixel positions that are not $0$ in at least $600$ images to improve the model's efficiency and accuracy.
Setting the feature matrix $\mathcal{train_images}$ as a $494$-dimensional vector, constructing a new dataset:
① Identify pixel positions that are not $0$ in at least $600$ images and record which columns are selected as features.
② Create an array
train_image_feature
to store the training image features and copy the columns with sufficient non-zero elements to the $j^{th}$ column of this array.③ Create an array
test_image_feature
to store the testing image features and select the same columns fromtest_images
using the same indices astrain_image_feature
.④ Add a constant $1$ at the beginning of the feature vectors for both the training and testing sets as bias.
⑤ Return the processed training and testing data:
train_labels, test_labels, train_images, test_images
.
def pretreat(train_labels, test_labels, train_images, test_images):
"""
Data preprocessing function: Feature selection
:param Original dataset
:return: Processed new dataset
"""
# Reshape the three-dimensional array into a two-dimensional array
train_labels = train_labels.reshape(60000)
test_labels = test_labels.reshape(10000)
train_images = train_images.reshape(60000, 784)
test_images = test_images.reshape(10000, 784)
# Convert labels to 1 and -1, if the digit is 0 then it is 1, otherwise -1
# np.where(condition,x,y): if condition is true, return x, otherwise return y
train_labels = np.where(train_labels == 0, 1, -1)
test_labels = np.where(test_labels == 0, 1, -1)
# Identify pixel positions that are not 0 in at least 600 images
j = 0
index = []
train_image_feature = np.zeros([60000, 493]) # To store training image features
for i in range(784):
non_zero = np.linalg.norm(train_images[:, i], ord=0) # Count the number of non-zero elements in each column
if non_zero >= 600:
train_image_feature[:, j] = train_images[:, i] # Copy columns with sufficient non-zero elements to the j-th column of train_image_feature
j = j + 1
index.append(i) # Record which columns are selected as features
test_image_feature = np.zeros([10000, 493]) # Create a zero array to store testing image features
test_image_feature = test_images[:, index] # Select the same columns from test_images as features using index and assign to test_image_feature
# Add a constant 1 at the beginning of the feature vectors for both training and testing sets as bias
train_images = np.insert(train_image_feature, 0, 1, axis=1)
test_images = np.insert(test_image_feature, 0, 1, axis=1)
return train_labels, test_labels, train_images, test_images
-
Train the classification model and evaluate the model.
The algorithm for solving the least squares problem using matrix decomposition is as follows:
① Compute the $QR$ decomposition of matrix $A$, $A=QR$.
② Compute $Q^Tb$.
③ Solve $Rx=Q^Tb$ to obtain $x$.
Function
clf_1(X_train, y_train, X_test, y_test):
trains the classification model and evaluates the model.Function
tran1or_0(result):
converts numbers greater than $0$ in the result to $1$, and those less than $0$ to $0$.Function
show_result(labels, result, dataset):
calculates the error rate and displays the results.
def clf_1(X_train, y_train, X_test, y_test):
"""
:param X_train: Training data
:param y_train: Training data labels
:param X_test: Testing data
:param y_test: Testing data labels
:return:
"""
# Start training
start1 = time.time()
A = X_train
A_test = X_test
b_train = y_train
b_test = y_test
print('Performing QR decomposition...')
q, r = np.linalg.qr(A)
print('QR decomposition completed')
print('\n')
print('Solving for x...')
x = np.linalg.pinv(r).dot(q.T.dot(b_train))
print('Solution for x obtained')
result = A.dot(x)
print('Result prediction completed')
tran1or_0(result)
end1 = time.time()
# Result display analysis
show_result(train_labels, result, 'Train_dataset')
result_test = A_test.dot(x)
tran1or_0(result_test)
show_result(test_labels, result_test, 'Test_dataset')
print("Execution time: %.2f seconds\n" % (end1 - start1))
2.1.2 Result Display#
2.2 Adding 5000 Random Features for Reclassification#
Adding Random Features
Utilizing the random feature method to construct a random feature matrix: a random matrix $R\in R_{5000\times 494}$, where $R_{ij}$ takes random values of $\pm1$, calculated as new features combined with the original features, resulting in a total of $5494$ features.
The practical significance of adding $5000$ random features is to randomly retain features through the $\max$ function; if the feature is greater than $0$, the original feature is retained; if the random value is $-1$, which is less than $0$, the $\max$ function takes $0$, and this feature is not retained.
Thus, the matrix $A$ is multiplied by the transpose of matrix $R$, resulting in a $60000\times5000$ matrix, which is concatenated to $A(60000\times494)$ to form a new matrix $A(60000\times 5494)$, with the remaining steps being the same as above.
2.2.1 Specific Steps#
- Load the dataset and preprocess the data as above.
- Train the classification model and evaluate the model.
Function clf_2(X_train, y_train, X_test, y_test):
def clf_2(X_train, y_train, X_test, y_test):
"""
:param X_train: Training data
:param y_train: Training data labels
:param X_test: Testing data
:param y_test: Testing data labels
:return:
"""
Random_feature = np.random.choice([-1, 1], (5000, 494))
# Start training
start2 = time.time()
A = X_train
A_test = X_test
A_random = A.dot(Random_feature.T)
A_random[A_random < 0] = 0
A_add = np.hstack([X_train, A_random])
b_train = y_train
b_test = y_test
print('Performing QR decomposition...')
q_add, r_add = np.linalg.qr(A_add)
print('QR decomposition completed')
print('Solving for x...')
x_add = np.linalg.pinv(r_add).dot(q_add.T.dot(b_train))
print('Solution for x obtained')
result_add = A_add.dot(x_add)
print('Result prediction completed')
tran1or_0(result_add)
end2 = time.time()
# Result display analysis
show_result(train_labels, result_add, 'Train_dataset_ADD')
A_random_test = np.dot(A_test, Random_feature.T) ##10000*50000
A_random_test[A_random_test < 0] = 0
A_add_test = np.hstack([A_test, A_random_test]) ## 10000*5494
result_add_test = A_add_test.dot(x_add)
tran1or_0(result_add_test)
show_result(test_labels, result_add_test, 'Test_dataset_ADD')
print("Execution time: %.2f seconds\n" % (end2 - start2))
2.2.2 Result Display#
3. Results and Analysis#
Training Set Classification Error Rate | Testing Set Classification Error Rate | Execution Time (s) | |
---|---|---|---|
$494$ Feature Dataset | $1.553%$ | $1.580%$ | $1.38$ |
$5494$ Feature Dataset | $0.218%$ | $0.220%$ | $145.23$ |
- Accuracy: After adding $5000$ random matrices, the complexity of the model increased, allowing the model to learn more generalized features, thereby improving the model's performance and generalization ability, resulting in better performance on the dataset. The experimental results show that the model's accuracy has significantly improved on both the training and testing sets.
- Execution Time Analysis: After adding random features, the matrix $A$ was expanded, and the error rate significantly decreased. However, the execution time was very slow due to the large size of the generated random matrix, approximately $60000\times 5000=3$ billion data points.
4. Complete Code#
import numpy as np
import pandas as pd
import struct
import prettytable as pt
import time
# Training set file
train_images_idx3_ubyte_file = r'C:\Users\路正桥\Documents\train-images.idx3-ubyte'
# Training set label file
train_labels_idx1_ubyte_file = r'C:\Users\路正桥\Documents\train-labels.idx1-ubyte'
# Testing set file
test_images_idx3_ubyte_file = r'C:\Users\路正桥\Documents\t10k-images.idx3-ubyte'
# Testing set label file
test_labels_idx1_ubyte_file = r'C:\Users\路正桥\Documents\t10k-labels.idx1-ubyte'
def decode_idx3_ubyte(idx3_ubyte_file):
"""
General function to parse idx3 files
:param idx3_ubyte_file: idx3 file path
:return: Dataset
"""
# Read binary data
bin_data = open(idx3_ubyte_file, 'rb').read()
# Parse file header information, including magic number, number of images, height and width of each image
offset = 0
fmt_header = '>iiii' # Since the first 4 rows of data structure are 32-bit integers, use i format, but we need to read the first 4 rows, so we need 4 i. We will see that in the label set, only 2 ii are used.
magic_number, num_images, num_rows, num_cols = struct.unpack_from(fmt_header, bin_data, offset)
print('Number of images: %d, Image size: %d*%d\n' % (num_images, num_rows, num_cols))
# Parse dataset
image_size = num_rows * num_cols
offset += struct.calcsize(fmt_header) # Get the pointer position of the data in the buffer, from the data structure introduced earlier, we can see that after reading the first 4 rows, the pointer position (i.e., offset) points to 0016.
fmt_image = '>' + str(
image_size) + 'B' # The pixel value type of image data is unsigned char, corresponding format is B. Here we add the image size of 784 to read 784 B format data; if not, it will only read one value (i.e., one pixel value in an image)
images = np.empty((num_images, num_rows, num_cols))
for i in range(num_images):
images[i] = np.array(struct.unpack_from(fmt_image, bin_data, offset)).reshape((num_rows, num_cols))
# print(images[i])
offset += struct.calcsize(fmt_image)
return images
def decode_idx1_ubyte(idx1_ubyte_file):
"""
General function to parse idx1 files
:param idx1_ubyte_file: idx1 file path
:return: Dataset
"""
# Read binary data
bin_data = open(idx1_ubyte_file, 'rb').read()
# Parse file header information, including magic number and number of labels
offset = 0
fmt_header = '>ii'
magic_number, num_images = struct.unpack_from(fmt_header, bin_data, offset)
print('Number of labels: %d\n' % (num_images))
# Parse dataset
offset += struct.calcsize(fmt_header)
fmt_image = '>B'
labels = np.empty(num_images)
for i in range(num_images):
labels[i] = struct.unpack_from(fmt_image, bin_data, offset)[0]
offset += struct.calcsize(fmt_image)
return labels
def load_train_images(idx_ubyte_file=train_images_idx3_ubyte_file):
"""
:param idx_ubyte_file: idx file path
:return: n*row*col dimensional np.array object, n is the number of images
"""
print("Training image data has been parsed:")
return decode_idx3_ubyte(idx_ubyte_file)
def load_train_labels(idx_ubyte_file=train_labels_idx1_ubyte_file):
"""
:param idx_ubyte_file: idx file path
:return: n*1 dimensional np.array object, n is the number of images
"""
print("Training label data has been parsed:")
return decode_idx1_ubyte(idx_ubyte_file)
def load_test_images(idx_ubyte_file=test_images_idx3_ubyte_file):
"""
:param idx_ubyte_file: idx file path
:return: n*row*col dimensional np.array object, n is the number of images
"""
print("Testing image data has been parsed:")
return decode_idx3_ubyte(idx_ubyte_file)
def load_test_labels(idx_ubyte_file=test_labels_idx1_ubyte_file):
"""
:param idx_ubyte_file: idx file path
:return: n*1 dimensional np.array object, n is the number of images
"""
print("Testing label data has been parsed:")
return decode_idx1_ubyte(idx_ubyte_file)
def pretreat(train_labels, test_labels, train_images, test_images):
"""
Data preprocessing function: Feature selection
:param Original dataset
:return: Processed new dataset
"""
# Reshape the three-dimensional array into a two-dimensional array
train_labels = train_labels.reshape(60000)
test_labels = test_labels.reshape(10000)
train_images = train_images.reshape(60000, 784)
test_images = test_images.reshape(10000, 784)
# Convert labels to 1 and -1, if the digit is 0 then it is 1, otherwise -1
# np.where(condition,x,y): if condition is true, return x, otherwise return y
train_labels = np.where(train_labels == 0, 1, -1)
test_labels = np.where(test_labels == 0, 1, -1)
# Identify pixel positions that are not 0 in at least 600 images
j = 0
index = []
train_image_feature = np.zeros([60000, 493]) # To store training image features
for i in range(784):
non_zero = np.linalg.norm(train_images[:, i], ord=0) # Count the number of non-zero elements in each column
if non_zero >= 600:
train_image_feature[:, j] = train_images[:, i] # Copy columns with sufficient non-zero elements to the j-th column of train_image_feature
j = j + 1
index.append(i) # Record which columns are selected as features
test_image_feature = np.zeros([10000, 493]) # Create a zero array to store testing image features
test_image_feature = test_images[:, index] # Select the same columns from test_images as features using index and assign to test_image_feature
# Add a constant 1 at the beginning of the feature vectors for both training and testing sets as bias
train_images = np.insert(train_image_feature, 0, 1, axis=1)
test_images = np.insert(test_image_feature, 0, 1, axis=1)
return train_labels, test_labels, train_images, test_images
def clf_1(X_train, y_train, X_test, y_test):
"""
:param X_train: Training data
:param y_train: Training data labels
:param X_test: Testing data
:param y_test: Testing data labels
:return:
"""
# Start training
start1 = time.time()
A = X_train
A_test = X_test
b_train = y_train
b_test = y_test
print('Performing QR decomposition...')
q, r = np.linalg.qr(A)
print('QR decomposition completed')
print('\n')
print('Solving for x...')
x = np.linalg.pinv(r).dot(q.T.dot(b_train))
print('Solution for x obtained')
result = A.dot(x)
print('Result prediction completed')
tran1or_0(result)
end1 = time.time()
# Result display analysis
show_result(train_labels, result, 'Train_dataset')
result_test = A_test.dot(x)
tran1or_0(result_test)
show_result(test_labels, result_test, 'Test_dataset')
print("Execution time: %.2f seconds\n" % (end1 - start1))
def clf_2(X_train, y_train, X_test, y_test):
"""
:param X_train: Training data
:param y_train: Training data labels
:param X_test: Testing data
:param y_test: Testing data labels
:return:
"""
Random_feature = np.random.choice([-1, 1], (5000, 494))
# Start training
start2 = time.time()
A = X_train
A_test = X_test
A_random = A.dot(Random_feature.T)
A_random[A_random < 0] = 0
A_add = np.hstack([X_train, A_random])
b_train = y_train
b_test = y_test
print('Performing QR decomposition...')
q_add, r_add = np.linalg.qr(A_add)
print('QR decomposition completed')
print('Solving for x...')
x_add = np.linalg.pinv(r_add).dot(q_add.T.dot(b_train))
print('Solution for x obtained')
result_add = A_add.dot(x_add)
print('Result prediction completed')
tran1or_0(result_add)
end2 = time.time()
# Result display analysis
show_result(train_labels, result_add, 'Train_dataset_ADD')
A_random_test = np.dot(A_test, Random_feature.T) ##10000*50000
A_random_test[A_random_test < 0] = 0
A_add_test = np.hstack([A_test, A_random_test]) ## 10000*5494
result_add_test = A_add_test.dot(x_add)
tran1or_0(result_add_test)
show_result(test_labels, result_add_test, 'Test_dataset_ADD')
print("Execution time: %.2f seconds\n" % (end2 - start2))
def tran1or_0(result):
"""
Convert numbers greater than 0 to 1, and those less than 0 to 0
:param
:return:
"""
result[result > 0] = 1
result[result < 0] = -1
result.reshape(len(result), 1)
return result
def show_result(labels, result, dataset):
"""
Calculate error rate and display results
:param labels: True data labels
:param result: Predicted data labels
:param dataset: Dataset name (train or test)
:return:
"""
TP = 0 # True positives
FN = 0 # False negatives
FP = 0 # False positives
TN = 0 # True negatives
for i in range(len(labels)):
if labels[i] == 1 and result[i] == 1:
TP = TP + 1
elif labels[i] == 1 and result[i] == -1:
FN = FN + 1
elif labels[i] == -1 and result[i] == 1:
FP = FP + 1
elif labels[i] == -1 and result[i] == -1:
TN = TN + 1
data = {
dataset: ['y=+1', 'y=-1', 'All'],
'Predicted y=+1': [TP, FN, TP + FN],
'Predicted y=-1': [FP, TN, FP + TN],
'Total': [TP + FP, FN + TN, TP + FP + FN + TN]
}
df = pd.DataFrame(data)
# Set table style
styled_df = df.style \
.set_properties(**{'text-align': 'center', 'border': '1px solid black'}) \
.set_table_styles([{'selector': 'th', 'props': [('border', '1px solid black')]}]) \
.set_caption(dataset + ' Prediction Results:')
# Output table
display(styled_df)
error = (FN + FP) / (TP + FP + FN + TN) * 100
print('Error rate is', "%.3f" % error, '%')
print('\n')
if __name__ == '__main__':
# Load original data
train_images = load_train_images()
train_labels = load_train_labels()
test_images = load_test_images()
test_labels = load_test_labels()
# Obtain data in array form, now the array is three-dimensional
[train_labels, test_labels, train_images, test_images] = pretreat(train_labels, test_labels, train_images,
test_images)
# Train the model to obtain results
clf_1(train_images, train_labels, test_images, test_labels)
# Re-train the model after adding 5000 random features, evaluate the model
print("Results after adding 5000 random features:\n")
clf_2(train_images, train_labels, test_images, test_labels)