A first experience with Kaggle competitions

April 10, 2018    Kaggle NLP Machine learning

I recently completed my first Kaggle competition, and thought I'd share some thoughts on the experience. I have been aware of the Kaggle platform for quite some time, however haven't had the time to jump into a competition until this year. Initially I thought I would just take a look at how the competition and platform worked, try my hand at a few models, and not invest too much time or effort into it. However, things turned out quite differently.

Ultimately I ended up putting quite a lot of effort into the challange, and spent several evenings over the past few weeks getting deep into machine learning techniques, models, theory, and code.

I finished in the top 3% of this competition, and I'm both happy and proud of that result.

The challenge was to correctly classify a set of Wikipedia comments as one of six toxic categories (toxic, severe toxic, insult, obscene, identity hate, and threat). A comment could fall into multiple categories, or none at all.

I soon realised that the Kaggle community is quite generous in providing guidance and examples in the form of kernels (code notebooks). Many people create these to give an introduction to a particular technique or model type.

The models I ended up using in my final ensemble included:

  • Long short term memory (LSTM) recurrent neural networks (RNN), using embeddings from:
  • Gated recurrent units (GRU) RNNs, using the same embeddings
  • Logistic regression
  • Naive Bayes support vector machine (SVM)
  • Convolutional neural networks (CNN)
  • Gradient boosting
And to support hyperparameter optimisation and bagging I used the keras framework.

Sometimes in NLP (Natural Language Processing) problems, preprocessing the text data can help. However in this particular case I found that some tradiational preprocessing approaches can reduce the amount of information it is possible to derive from words or sequences of words. Ultimately, the only preprocessing I applied here was to a) translate non-English comments, and b) correct spelling of commonly mispelt offensive words.

The winners' write-up is very interesting. I highly recommend a read if you participated in this competition, or are considering a similar NLP problem.

One mistake I made early on, was not to strictly prepare a hold out set for validation. I had not anticipated becoming so heavily involved in this challenge, and so at the beginning I was not sufficiently disciplined in my cross validation techniques. That certainly cost me some places on the leaderboard — and I've learnt from that experience.

Ultimately I enjoyed this competition immensely. The steep learning curve with some of these models and techniques often meant a deep dive into the academic papers which proposed these approaches, sometimes only a matter of months ago. Then there was endless testing, troubleshooting (e.g. getting CUDA and cuDNN to run on Fedora), and at the end of the day a nerve-wracking upload of the submission you think best predicts the comment toxicity.

But I think the most important thing is, as I said on the Kaggle forums:

I’ve learned more from practical, hands-on, hacking away at a concrete problem than I would have in six weeks of classes.

Now I'm looking forward to the next competition! If you're also interested in participating, let me know and we can discuss teaming up!