# Uday's Tech Blog # Solved: Titanic Dataset | Kaggle Competition

pic credit: Unsplash

The Titanic Kaggle Competition is one of the "Getting Started" competitions for data science and machine learning practitioners. It's an open competition and the dataset is quite famous actually. Even ChatGPT knows it.

I trained different machine learning models that can learn from the dataset. This blog showcases the preprocessing steps I used, the models I trained and the performance comparisons between them. Without further ado, let's begin.

The objective of the competition is to build a model that can predict if a passenger onboard the Titanic would have survived given some data points about each passenger.

The model would take into consideration data points such as gender, age, ticket class etc. and try to predict if the passenger had a better chance of survival.

# Exploring the dataset

As with any machine learning model building, I started with exploratory data analysis (EDA) to get a feel for the dataset. Here's my Kaggle notebook if you wanna follow along.

``````# read data
``````
``````# check data
`````` To understand the feature provided in the dataset, we can use the data dictionary provided by Kaggle.

 Variable Definition Key survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex Age Age in years sibsp # of siblings / spouses aboard the Titanic parch # of parents / children aboard the Titanic ticket Ticket number fare Passenger fare cabin Cabin number embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
``````# check shape
train_data_df.shape
``````

Output: (891, 12)

Check relationships among the features

``````# Create a subplot with 1 row and 2 columns
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(10, 5))

axes.hist(train_data_df[train_data_df.Survived == 1]['Pclass'])
axes.set_title('survived Class')

axes.hist(train_data_df[train_data_df.Survived != 1]['Pclass'])
axes.set_title('non-survived Class')

axes.hist(train_data_df[train_data_df.Survived == 1]['Age'])
axes.set_title('survived Age')

axes.hist(train_data_df[train_data_df.Survived != 1]['Age'])
axes.set_title('non-survived Age')

axes.hist(train_data_df[train_data_df.Survived == 1]['SibSp'])
axes.set_title('survived SibSp')

axes.hist(train_data_df[train_data_df.Survived != 1]['SibSp'])
axes.set_title('non-survived SibSp')

plt.tight_layout()
plt.show()
`````` # Preprocessing

``````# check for null values
train_data_df.isna().sum()
``````
``````PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
``````
``````# impute age with median
train_data_df['Age'].fillna(train_data_df['Age'].median(),inplace=True)
``````
``````# check for null values
train_data_df.isna().sum()
``````
``````PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
``````
``````# drop cabin column
train_data_df.drop('Cabin',axis=1,inplace=True)
``````
``````# drop name column, ticket column
train_data_df.drop(["Name","Ticket"],axis=1,inplace=True)
``````
``````# drop passenger id
train_data_df.drop(["PassengerId"],axis=1,inplace=True)
``````
``````# convert age as int
train_data_df['Age'] = train_data_df['Age'].astype(int)
``````
``````# one hot encode Embarked column and drop it from the original df
encoded_df = pd.get_dummies(train_data_df['Embarked'], prefix='Embarked')
df_encoded = pd.concat([train_data_df, encoded_df], axis=1)
df_encoded = df_encoded.drop('Embarked', axis=1)
``````
``````# one hot encode Class column and drop it from the original df
encoded_df = pd.get_dummies(df_encoded['Pclass'], prefix='Pclass')
df_encoded = pd.concat([df_encoded, encoded_df], axis=1)
df_encoded = df_encoded.drop('Pclass', axis=1)
``````
``````# one hot encode Sex column and drop it from the original df
encoded_df = pd.get_dummies(df_encoded['Sex'], prefix='Sex')
df_encoded = pd.concat([df_encoded, encoded_df], axis=1)
df_encoded = df_encoded.drop('Sex', axis=1)
``````
``````df_encoded.head()
`````` ``````# Splitting the data into features and target variable
X = df_encoded.drop('Survived', axis=1)  # Features
y = df_encoded['Survived']  # Target variable
``````
``````from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
``````
``````# scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
``````

Below are the different models I trained with their respective accuracy scores. At the end, we have a comparison table.

# Logistic Regression

``````from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create a logistic regression model
# lr_model = LogisticRegression()
lr_model = LogisticRegression(penalty='l2', solver='lbfgs', random_state=42)

# Train the model
lr_model.fit(X_train, y_train)

# Predict on the test set
y_pred = lr_model.predict(X_test)

# Calculate accuracy on the test set
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
``````
``````Accuracy: 0.7988826815642458
``````

Plotting the ROC Curve

``````from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Predict probabilities for the positive class
y_pred_probs = lr_model.predict_proba(X_test)[:, 1]

# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)

# Calculate the AUC score
auc_score = roc_auc_score(y_test, y_pred_probs)

# Plot the ROC curve
plt.plot(fpr, tpr, label='ROC curve (AUC = {:.2f})'.format(auc_score))
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
`````` # Artificial Neural Network

``````import tensorflow as tf

# Build the ANN model
model = tf.keras.Sequential([
tf.keras.layers.Dense(32, input_shape=(X_train.shape,)),
tf.keras.layers.BatchNormalization(),  # Add batch normalization layer
tf.keras.layers.Activation('relu'),  # Apply activation function after batch normalization
tf.keras.layers.Dense(64),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('relu'),
tf.keras.layers.Dense(64),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('relu'),
tf.keras.layers.Dense(128),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('relu'),
tf.keras.layers.Dense(128),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
``````
``````# Compile the model
``````
``````# Train the model
model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test))
`````` # Decision Tree Classifier

``````import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Decision Tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Train the classifier
dt_classifier.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = dt_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
``````
``````Accuracy: 0.7821229050279329
``````

# Random Forest Classifier with Adaboost

``````import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=200, random_state=42)

# Train the classifier
rf_classifier.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
``````
``````Accuracy: 0.8100558659217877
``````

# Random Forest with Grid Search

## Creating new features from the dataset

``````# Perform feature engineering and preprocessing
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.')
df['CabinDeck'] = df['Cabin'].str
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 30, 50, float('inf')], labels=['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior'])
df['FarePerPerson'] = df['Fare'] / (df['SibSp'] + df['Parch'] + 1)
df['TicketPrefix'] = df['Ticket'].str.split().str.str.replace(".", "").str.replace("/", "").str.upper()
``````
``````# Select relevant features and target variable
features = ['Pclass', 'Sex', 'AgeGroup', 'FarePerPerson', 'Embarked', 'Title', 'CabinDeck','TicketPrefix']
target = 'Survived'
X = df[features]
y = df[target]

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Perform standard scaling on numeric features
numeric_features = ['Pclass', 'FarePerPerson']
numeric_transformer = StandardScaler()
X_train[numeric_features] = numeric_transformer.fit_transform(X_train[numeric_features])
X_test[numeric_features] = numeric_transformer.transform(X_test[numeric_features])
``````
``````# Perform one-hot encoding on categorical features
categorical_features = ['Sex', 'AgeGroup', 'Embarked', 'Title', 'CabinDeck', 'TicketPrefix']
categorical_transformer = OneHotEncoder()
# X_train_encoded = pd.get_dummies(X_train, columns=categorical_features, drop_first=True)
# X_test_encoded = pd.get_dummies(X_test, columns=categorical_features, drop_first=True)

# Combine the training and testing data
combined_data = pd.concat([X_train, X_test])

# Apply one-hot encoding on combined data
combined_data_encoded = pd.get_dummies(combined_data, columns=categorical_features, drop_first=True)

# Split the combined data back into training and testing sets
X_train_encoded = combined_data_encoded[:len(X_train)]
X_test_encoded = combined_data_encoded[len(X_train):]
``````
``````# Define the model
model = RandomForestClassifier()

# Define hyperparameters to tune
parameters = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]
}

# Perform grid search to find the best model
grid_search = GridSearchCV(model, parameters, cv=5)
grid_search.fit(X_train_encoded, y_train)
``````
``````# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
accuracy = best_model.score(X_test_encoded, y_test)
print("Best Model Accuracy:", accuracy)
``````
``````Best Model Accuracy: 0.8268156424581006
``````

# Model Comparison

 MODEL ACCURACY SCORE logistic regression 79% artificial neural network 79% decision tree 78% random forest with AdaBoost 81% random forest with new features and GridSearch 82%

# Conclusion

Overall, I found each of the models to provide similar accuracy. I experimented with several hyper params such as the number of epochs in ANN, increasing the number of trees in random forests etc.

The accuracy scores are pretty decent and this is a good dataset to explore, practice and play around with different machine-learning models.

I hope you found this blog useful. Do leave a like if you did, I'd highly appreciate it.

Cheers,

Uday