Solved: Titanic Dataset | Kaggle Competition

Solved: Titanic Dataset | Kaggle Competition

pic credit: Unsplash

The Titanic Kaggle Competition is one of the "Getting Started" competitions for data science and machine learning practitioners. It's an open competition and the dataset is quite famous actually. Even ChatGPT knows it.

I trained different machine learning models that can learn from the dataset. This blog showcases the preprocessing steps I used, the models I trained and the performance comparisons between them. Without further ado, let's begin.

About the competition

The objective of the competition is to build a model that can predict if a passenger onboard the Titanic would have survived given some data points about each passenger.

The model would take into consideration data points such as gender, age, ticket class etc. and try to predict if the passenger had a better chance of survival.

Exploring the dataset

As with any machine learning model building, I started with exploratory data analysis (EDA) to get a feel for the dataset. Here's my Kaggle notebook if you wanna follow along.

# read data
train_data_df = pd.read_csv('/kaggle/input/titanic/train.csv')
# check data
train_data_df.head()

To understand the feature provided in the dataset, we can use the data dictionary provided by Kaggle.

VariableDefinitionKey
survivalSurvival0 = No, 1 = Yes
pclassTicket class1 = 1st, 2 = 2nd, 3 = 3rd
sexSex
AgeAge in years
sibsp# of siblings / spouses aboard the Titanic
parch# of parents / children aboard the Titanic
ticketTicket number
farePassenger fare
cabinCabin number
embarkedPort of EmbarkationC = Cherbourg, Q = Queenstown, S = Southampton
# check shape
train_data_df.shape

Output: (891, 12)

Check relationships among the features

# Create a subplot with 1 row and 2 columns
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(10, 5))

axes[0][0].hist(train_data_df[train_data_df.Survived == 1]['Pclass'])
axes[0][0].set_title('survived Class')

axes[0][1].hist(train_data_df[train_data_df.Survived != 1]['Pclass'])
axes[0][1].set_title('non-survived Class')

axes[0][2].hist(train_data_df[train_data_df.Survived == 1]['Age'])
axes[0][2].set_title('survived Age')

axes[1][0].hist(train_data_df[train_data_df.Survived != 1]['Age'])
axes[1][0].set_title('non-survived Age')

axes[1][1].hist(train_data_df[train_data_df.Survived == 1]['SibSp'])
axes[1][1].set_title('survived SibSp')

axes[1][2].hist(train_data_df[train_data_df.Survived != 1]['SibSp'])
axes[1][2].set_title('non-survived SibSp')

plt.tight_layout()
plt.show()

Preprocessing

# check for null values
train_data_df.isna().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
# impute age with median
train_data_df['Age'].fillna(train_data_df['Age'].median(),inplace=True)
# check for null values
train_data_df.isna().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
# drop cabin column
train_data_df.drop('Cabin',axis=1,inplace=True)
# drop name column, ticket column
train_data_df.drop(["Name","Ticket"],axis=1,inplace=True)
# drop passenger id
train_data_df.drop(["PassengerId"],axis=1,inplace=True)
# convert age as int
train_data_df['Age'] = train_data_df['Age'].astype(int)
# one hot encode Embarked column and drop it from the original df
encoded_df = pd.get_dummies(train_data_df['Embarked'], prefix='Embarked')
df_encoded = pd.concat([train_data_df, encoded_df], axis=1)
df_encoded = df_encoded.drop('Embarked', axis=1)
# one hot encode Class column and drop it from the original df
encoded_df = pd.get_dummies(df_encoded['Pclass'], prefix='Pclass')
df_encoded = pd.concat([df_encoded, encoded_df], axis=1)
df_encoded = df_encoded.drop('Pclass', axis=1)
# one hot encode Sex column and drop it from the original df
encoded_df = pd.get_dummies(df_encoded['Sex'], prefix='Sex')
df_encoded = pd.concat([df_encoded, encoded_df], axis=1)
df_encoded = df_encoded.drop('Sex', axis=1)
df_encoded.head()

# Splitting the data into features and target variable
X = df_encoded.drop('Survived', axis=1)  # Features
y = df_encoded['Survived']  # Target variable
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Below are the different models I trained with their respective accuracy scores. At the end, we have a comparison table.

Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


# Create a logistic regression model
# lr_model = LogisticRegression()
lr_model = LogisticRegression(penalty='l2', solver='lbfgs', random_state=42)

# Train the model
lr_model.fit(X_train, y_train)

# Predict on the test set
y_pred = lr_model.predict(X_test)

# Calculate accuracy on the test set
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.7988826815642458

Plotting the ROC Curve

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Predict probabilities for the positive class
y_pred_probs = lr_model.predict_proba(X_test)[:, 1]

# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)

# Calculate the AUC score
auc_score = roc_auc_score(y_test, y_pred_probs)

# Plot the ROC curve
plt.plot(fpr, tpr, label='ROC curve (AUC = {:.2f})'.format(auc_score))
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()

Artificial Neural Network

import tensorflow as tf

# Build the ANN model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, input_shape=(X_train.shape[1],)),
    tf.keras.layers.BatchNormalization(),  # Add batch normalization layer
    tf.keras.layers.Activation('relu'),  # Apply activation function after batch normalization
    tf.keras.layers.Dense(64),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.Dense(64),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.Dense(128),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.Dense(128),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test))

Decision Tree Classifier

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Decision Tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Train the classifier
dt_classifier.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = dt_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
Accuracy: 0.7821229050279329

Random Forest Classifier with Adaboost

import pandas as pd
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=200, random_state=42)
rf_classifier = AdaBoostClassifier(base_estimator=rf_classifier, n_estimators=100, random_state=42)

# Train the classifier
rf_classifier.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
Accuracy: 0.8100558659217877

Random Forest with Grid Search

Creating new features from the dataset

# Perform feature engineering and preprocessing
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.')
df['CabinDeck'] = df['Cabin'].str[0]
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 30, 50, float('inf')], labels=['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior'])
df['FarePerPerson'] = df['Fare'] / (df['SibSp'] + df['Parch'] + 1)
df['TicketPrefix'] = df['Ticket'].str.split().str[0].str.replace(".", "").str.replace("/", "").str.upper()
# Select relevant features and target variable
features = ['Pclass', 'Sex', 'AgeGroup', 'FarePerPerson', 'Embarked', 'Title', 'CabinDeck','TicketPrefix']
target = 'Survived'
X = df[features]
y = df[target]

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Perform standard scaling on numeric features
numeric_features = ['Pclass', 'FarePerPerson']
numeric_transformer = StandardScaler()
X_train[numeric_features] = numeric_transformer.fit_transform(X_train[numeric_features])
X_test[numeric_features] = numeric_transformer.transform(X_test[numeric_features])
# Perform one-hot encoding on categorical features
categorical_features = ['Sex', 'AgeGroup', 'Embarked', 'Title', 'CabinDeck', 'TicketPrefix']
categorical_transformer = OneHotEncoder()
# X_train_encoded = pd.get_dummies(X_train, columns=categorical_features, drop_first=True)
# X_test_encoded = pd.get_dummies(X_test, columns=categorical_features, drop_first=True)


# Combine the training and testing data
combined_data = pd.concat([X_train, X_test])

# Apply one-hot encoding on combined data
combined_data_encoded = pd.get_dummies(combined_data, columns=categorical_features, drop_first=True)

# Split the combined data back into training and testing sets
X_train_encoded = combined_data_encoded[:len(X_train)]
X_test_encoded = combined_data_encoded[len(X_train):]
# Define the model
model = RandomForestClassifier()

# Define hyperparameters to tune
parameters = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search to find the best model
grid_search = GridSearchCV(model, parameters, cv=5)
grid_search.fit(X_train_encoded, y_train)
# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
accuracy = best_model.score(X_test_encoded, y_test)
print("Best Model Accuracy:", accuracy)
Best Model Accuracy: 0.8268156424581006

Model Comparison

MODELACCURACY SCORE
logistic regression79%
artificial neural network79%
decision tree78%
random forest with AdaBoost81%
random forest with new features and GridSearch82%

Conclusion

Overall, I found each of the models to provide similar accuracy. I experimented with several hyper params such as the number of epochs in ANN, increasing the number of trees in random forests etc.

The accuracy scores are pretty decent and this is a good dataset to explore, practice and play around with different machine-learning models.

I hope you found this blog useful. Do leave a like if you did, I'd highly appreciate it.

Cheers,

Uday

Did you find this article valuable?

Support Uday Kiran Kavaturu by becoming a sponsor. Any amount is appreciated!