Machine Learning Pipeline

See NN_Jupyter for the full neural network code.

Importing Libraries

import ast
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from keras.models import Sequential
from keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

This block of code is importing all the necessary libraries for data manipulation, visualization, preprocessing, modeling, and evaluation. For instance, pandas for data manipulation, matplotlib for plotting graphs, sklearn for data splitting and metrics, and keras for neural network modeling.

Loading Data

data = pd.read_csv('output7.csv')
data.head()

Here, the code is loading a CSV file into a pandas DataFrame and displaying the first few rows with data.head()

Data Filtering

# Filter to only rows with labels that contain "Touch" or are "No_Gesture"
filtered_data = data[data['Label'].str.contains('Touch') | (data['Label'] == 'No_Gesture')]
print(filtered_data['Label'].unique())
data = filtered_data

This section filters the dataset to include only the rows where the Label column contains "Touch" or is exactly "No_Gesture". It then prints out the unique labels that remain after filtering. This section was added once I decided to remove the swipe gestures from current analysis. Much of the code between the two versions (with swipe vs without) remained the same, so I set data filtered_data for my own convenience.

Data Preprocessing

scaler = StandardScaler()
sensor_columns = [col for col in data.columns if 'Channel' in col]
print(data.columns)

for col in sensor_columns:
    data[col] = data[col].apply(lambda x: ast.literal_eval(x)[2] if isinstance(x, str) else x[2])

data[sensor_columns] = scaler.fit_transform(data[sensor_columns])

The code identifies columns related to sensor channels, then scales them using StandardScaler. It also extracts the third element from each entry in the sensor columns assuming they are string representations of lists. This is somewhat a mistake on my part; I was certain that saving the data with [CurrentValue, CurrentBaseline] per sensor would be useful, then I added Delta to make it more human-readable and quickly observable. The resulting output was [CurrentValue, CurrentBaseline, Delta], and while I do still believe it could be interesting to learn the baselines as well as the current value (see Pros and Cons for further discussion), in this case I decided to pre-process the data to consider only the changed value.

Label Encoding

label_encoder = LabelEncoder()
data['label'] = label_encoder.fit_transform(data['Label'])

The LabelEncoder is used to encode string labels as integers.

Data Splitting

X = data[sensor_columns]
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here, the features (X) and target (y) are defined, and then the dataset is split into training and test sets.

Model Definition

model = Sequential()
model.add(Dense(128, input_dim=len(sensor_columns), activation='relu', kernel_regularizer='l2'))
model.add(Dropout(0.3))
model.add(Dense(64, activation='relu', kernel_regularizer='l2'))
model.add(Dense(len(np.unique(y)), activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

This section creates a Sequential neural network model with Dense layers, including L2 regularization and dropout for regularization, and compiles it with the Adam optimizer and accuracy metric. See Model Detail for further discussion.

Training the Model

early_stopping = EarlyStopping(monitor='val_loss', patience=5)
history = model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)

The model is trained with a specified number of epochs and batch size, with a portion of the training data held out for validation. Early stopping is used to prevent overfitting.

Evaluation

loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy*100:.2f}%')

The trained model is evaluated on the test set, and the accuracy is printed out.

Plot Training History

plt.plot(history.history['accuracy'], label='Training accuracy')
plt.plot(history.history['val_accuracy'], label='Validation accuracy')
plt.title('Training and Validation Accuracy')
plt.legend()
plt.show()

This plots the training and validation accuracy of the model over the epochs.

Prediction and Reporting

y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
print(classification_report(y_test, y_pred_classes))

The model makes predictions on the test set, converts probabilities to class labels, and prints out a classification report.

Confusion Matrix

cm = confusion_matrix(y_test, y_pred_classes)
plt.figure(figsize=(10,10))
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix') 
plt.colorbar() 
tick_marks = np.arange(len(np.unique(y))) 
plt.xticks(tick_marks, label_encoder.classes_, rotation=45) 
plt.yticks(tick_marks, label_encoder.classes_) plt.tight_layout() 
plt.ylabel('True label') plt.xlabel('Predicted label') plt.show()

This plots the Confusion Matrix, which helps visualize predictions per class (specifically, how many are correct and incorrect).