Just Some Geek: ML: Feature engineering (Titanic pt.3)

Another technique we previously mentioned that can be used to try making our predictions better is Feature Engineering. In todays article we will take a look on how we can apply it to the Titanic dataset.

This article is part of Titanic series - a short series on basic ML concepts based on the famous Titanic Kaggle challenge

We will build on our basic model from part 1, so head there in case you missed it.

age

We will start with a basic example - creating a new feature from the Age column. Our model might benefit from changing the Age data from continuous numerical data type to ordinal categorical.

To do so, we will fill empty age rows with -0.5 to indicate such special case and we will define cut points and label names to take Age column data and create a new categorical column based on it:

cut_points = [-1, 0, 5, 12, 18, 35, 60, 100]
label_names = ["Missing", "Infant", "Child", "Teenager", "YoungAdult", "Adult", "Senior"]

def process_age(df, cut_points, label_names):
    df["Age"] = df["Age"].fillna(-0.5)
    df["AgeCategory"] = pd.cut(df["Age"], cut_points, labels=label_names)               

    return df

train = process_age(train, cut_points, label_names)
test = process_age(test, cut_points, label_names)

This will result with a new column “AgeCategory” with respective label strings in it. To use it in our model we need to encode it to numbers.

train["AgeCategory"] = pd.factorize(train["AgeCategory"])[0]                    
test["AgeCategory"] = pd.factorize(test["AgeCategory"])[0]                                  

Using this new categorical feature is simple, we just change the column name to be used for training our model.

columns = ["Fare", "Sex", "Pclass", "AgeCategory"]                                              

Training the model this way does not improve its performance, on the contrary, it lowers it by ~3%, it seems like the lost detail contained in continual numerical data hurts our logistic classifier. We could change and/or tweak the classifier, but we will keep it as it is to make sure we can compare the results easily.

We will discard the AgeCategory feature and fall back to using the original Age feature.

family size

Another, even simpler example is creating a completely new feature based on the available data. Adding features Sibling/Spouse count and Parent/Children count and adding 1 (counting “me” as well) results in Family Size count. We can try using this in our model to see if there is any correlation with the result.

train["FamilySize"] = train["SibSp"] + train["Parch"] + 1
test["FamilySize"] = test["SibSp"] + test["Parch"] + 1

Using this feature in our model does not change its performance at all - but we may keep it for future and perhaps some other model will benefit from it.

title

Next example of feature engineering is extracting title from the Name feature. To do so, we will define a list of titles to be extracted and create a new feature out of them.

def titles_in_name(name: str, titles: list):
    for title in titles:
        if title in name:
            return title
    return np.nan

title_list=["Mrs", "Mr", "Master", "Miss", "Major", "Rev",
            "Dr", "Ms", "Mlle","Col", "Capt", "Mme", "Countess",
            "Don", "Jonkheer"]
train["Title"] = train["Name"].map(lambda x: titles_in_name(x, title_list))
test["Title"] = test["Name"].map(lambda x: titles_in_name(x, title_list))

To use the extracted data we will categorize them, as some of the titles in the dataset represent similar or even same concept (e.g. in different languages, a bit of online search helped with this info).

def categorize_titles(person):
    title = person["Title"]

    if title in ["Don", "Major", "Capt", "Jonkheer", "Rev", "Col"]:
        return "Mr"
    elif title in ["Countess", "Mme"]:
        return "Mrs"
    elif title in ["Mlle", "Ms"]:
        return "Miss"
    elif title in ["Dr"]:
        if person["Sex"] == "Male":
            return "Mr"
        else:
            return "Mrs"
    else:
        return title

train["Title"] = train.apply(categorize_titles, axis=1)
test["Title"] = test.apply(categorize_titles, axis=1)
train["Title"] = pd.factorize(train["Title"])[0]
test["Title"] = pd.factorize(test["Title"])[0]

Using this new feature bumps the performance of our model slightly, by approximately 1%.

Note: Using the Title feature without using the FamilySize feature actually lowers the performance of the model.

conclusion

Feature engineering is not simple, but looking at the data thoroughly can result is improved results. It can be used to extract information from available data, add more insight into the dataset by human intervention (e.g. categorizing “Don” and “Major” as “Mr”) and even reduce the number of features to improve the training performance of the model.

As always, the source code is available on github.

Keep learning!

This article is part of Titanic series - a short series on basic ML concepts based on the famous Titanic Kaggle challenge