Machine Learning Pipeline
See NN_Jupyter for the full neural network code.
Importing Libraries
import ast
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from keras.models import Sequential
from keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
This block of code is importing all the necessary libraries for data manipulation, visualization, preprocessing, modeling, and evaluation. For instance, pandas
for data manipulation, matplotlib
for plotting graphs, sklearn
for data splitting and metrics, and keras
for neural network modeling.
Loading Data
data = pd.read_csv('output7.csv')
data.head()
Here, the code is loading a CSV file into a pandas DataFrame and displaying the first few rows with data.head()
Data Filtering
# Filter to only rows with labels that contain "Touch" or are "No_Gesture"
filtered_data = data[data['Label'].str.contains('Touch') | (data['Label'] == 'No_Gesture')]
print(filtered_data['Label'].unique())
data = filtered_data
This section filters the dataset to include only the rows where the Label
column contains "Touch" or is exactly "No_Gesture". It then prints out the unique labels that remain after filtering. This section was added once I decided to remove the swipe gestures from current analysis. Much of the code between the two versions (with swipe vs without) remained the same, so I set data filtered_data
for my own convenience.
Data Preprocessing
scaler = StandardScaler()
sensor_columns = [col for col in data.columns if 'Channel' in col]
print(data.columns)
for col in sensor_columns:
data[col] = data[col].apply(lambda x: ast.literal_eval(x)[2] if isinstance(x, str) else x[2])
data[sensor_columns] = scaler.fit_transform(data[sensor_columns])
The code identifies columns related to sensor channels, then scales them using StandardScaler
. It also extracts the third element from each entry in the sensor columns assuming they are string representations of lists. This is somewhat a mistake on my part; I was certain that saving the data with [CurrentValue, CurrentBaseline]
per sensor would be useful, then I added Delta to make it more human-readable and quickly observable. The resulting output was [CurrentValue, CurrentBaseline, Delta]
, and while I do still believe it could be interesting to learn the baselines as well as the current value (see Pros and Cons for further discussion), in this case I decided to pre-process the data to consider only the changed value.
Label Encoding
label_encoder = LabelEncoder()
data['label'] = label_encoder.fit_transform(data['Label'])
The LabelEncoder
is used to encode string labels as integers.
Data Splitting
X = data[sensor_columns]
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Here, the features (X
) and target (y
) are defined, and then the dataset is split into training and test sets.
Model Definition
model = Sequential()
model.add(Dense(128, input_dim=len(sensor_columns), activation='relu', kernel_regularizer='l2'))
model.add(Dropout(0.3))
model.add(Dense(64, activation='relu', kernel_regularizer='l2'))
model.add(Dense(len(np.unique(y)), activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
This section creates a Sequential neural network model with Dense layers, including L2 regularization and dropout for regularization, and compiles it with the Adam optimizer and accuracy metric. See Model Detail for further discussion.
Training the Model
early_stopping = EarlyStopping(monitor='val_loss', patience=5)
history = model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)
The model is trained with a specified number of epochs and batch size, with a portion of the training data held out for validation. Early stopping is used to prevent overfitting.
Evaluation
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy*100:.2f}%')
The trained model is evaluated on the test set, and the accuracy is printed out.
Plot Training History
plt.plot(history.history['accuracy'], label='Training accuracy')
plt.plot(history.history['val_accuracy'], label='Validation accuracy')
plt.title('Training and Validation Accuracy')
plt.legend()
plt.show()
This plots the training and validation accuracy of the model over the epochs.
Prediction and Reporting
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
print(classification_report(y_test, y_pred_classes))
The model makes predictions on the test set, converts probabilities to class labels, and prints out a classification report.
Confusion Matrix
cm = confusion_matrix(y_test, y_pred_classes)
plt.figure(figsize=(10,10))
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()
tick_marks = np.arange(len(np.unique(y)))
plt.xticks(tick_marks, label_encoder.classes_, rotation=45)
plt.yticks(tick_marks, label_encoder.classes_) plt.tight_layout()
plt.ylabel('True label') plt.xlabel('Predicted label') plt.show()
This plots the Confusion Matrix, which helps visualize predictions per class (specifically, how many are correct and incorrect).