Have you ever thought about entering a Kaggle competition but you didn’t really know where to start? Or maybe you felt a little nervous so you went to the site and there are so many things and it was very overwhelming and you don’t know how to do it. Don’t worry, we have got you covered. If you don’t know what Kaggle actually is then you can read about Kaggle because here, we are only going to talk about how to get started in a machine learning competition on Kaggle specifically, the Titanic machine learning competition. It is designed to be the best possible starting place for you.
- We shall talk about:
- Why and how to do machine learning competitions?
- What this specific competition is about?
- What should you do first?
- How can you improve your score once you started?
Why should you compete in this competition?
In general, machine learning competitions are a really nice way to play with different techniques and methods where all of the hard and kind of boring work about data science has already been done for you. The data has been clean, they have developed the metric, there’s a really clear problem, there’s a nice description of everything that’s in the data so you don’t have to have to figure that out yourself. You can just try different methods and see what works.
It’s also a really nice way to familiarize yourself with Kaggle and everything that’s on Kaggle. So of course, they have competitions, but in the course of the competition you might want to use some different data sets and you might look through Kaggle data sets. You can actually write and run your code directly on Kaggle using Kaggle notebooks and then submit from one of your notebooks. There is a genuine community on this platform so you can make some friends. It is really a good thing for new Kagglers. It’s a really nice place to ask some questions, answer some questions and become a part of the Kaggle community. In case, you really aren’t interested in competitions, the leaderboard is actually cleared every three months. So if you’re playing around and you don’t get a great score, that’s fine, it’ll disappear.
What this specific competition is about?
Let’s dive into the challenge. The Titanic was a passenger ship that very famously and tragically sunk on its maiden voyage and the majority of people who were on board the boat died. So the thing that this competition wants you to figure out is how to build a model to help predict what factors might have led to somebody surviving or not. It’ll give you information on each of the passengers for some of them it will tell you whether or not they perished in the sinking or not so you can use that information to train your model and then for some of the other passengers they won’t. Your job is to try and guess for those passengers that they haven’t told you about whether they survived or not.
What should you do first?
The first thing you need to get started is, accept the rules and join the competition. From there, you need to get the data (the data is going to be under the data tab on the competition page). The data will be broken into two files; one is the training data (has all of the different information, features, for example, information on whether or not the passenger died). So you’re going to use that for your training and your evaluation in your cross-validation. The second data set doesn’t have any labels and that’s the data set that you’re going to send your predictions back for. The information about your prediction mentioned in this data (by you) will determine your leaderboard position.
- Once you have the data, the next step is to understand the problem. You have to know about the problem very thoroughly. You can research about Titanic in this competition to understand what the problem is. Then do some exploratory data analysis such as are there missing values? Are there skewed fields? How are you going to deal with these things? So you get to know and understand the data set.
- Now, start your modeling. So tune your training models, change the hyperparameters of the models to try and get better results for your models, and then ensemble the models (taking your models and putting them together) to get your first predictions.
- Once you are done with your predictions you have to upload those predictions and you’ll get a score. You will be on the leaderboard and start your journey here.
How can you improve your score once you started?
This is the thing everybody wants to know. How to improve my score on the Kaggle leaderboard?
The first thing you should do is to learn about the data. As we said earlier that you need to understand the problem, for the Titanic in particular as it happened in the past and there’s no new data being produced around it. So, you can turn to historical sources and start to learn more about the situation to develop your understanding. You can use that understanding to guide your experimentation like:
- Creating new features (feature engineering) based on what you know about the data
- Try different types of pre-processing – If you tried one method of filling and missing values or imputing them and you may try a different method and see if that changes your results
- Different types of machine learning models – You might try a random forest base model, try a support vector machine, try a regression model, or try many different types of models and then ensemble them as the majority of Kaggle competitions are won by some sort of ensemble model that has multiple different models combined together.
- Then finally, this is probably actually the most effective way to improve your score – learn from other folks who are doing the same competition. You’re all just starting and getting your bearings so you can learn a lot of things from them. People will share lots of helpful codes, ask and answer questions in the forums, and you can help build your understanding as part of the community. So you can become a full-fledged Kaggler and climb up your rankings.
Final Words
So these are the basic things that you need to know when you want to get started on any of the Kaggle competitions. You can do some experiments and research to take the score of your models to the top. Remember that you have to make your predictions very precisely in different models and then ensemble them to make it on because that’s the best way to get the more score on Kaggle.