A first experience with Kaggle competitions

April 10, 2018 Kaggle NLP Machine learning

I recently completed my first Kaggle competition, and thought I'd share some thoughts on the experience. I have been aware of the Kaggle platform for quite some time, however haven't had the time to jump into a competition until this year. Initially I thought I would just take a look at how the competition and platform worked, try my hand at a few models, and not invest too much time or effort into it. However, things turned out quite differently.

Ultimately I ended up putting quite a lot of effort into the challange, and spent several evenings over the past few weeks getting deep into machine learning techniques, models, theory, and code.

I finished in the top 3% of this competition, and I'm both happy and proud of that result.

The challenge was to correctly classify a set of Wikipedia comments as one of six toxic categories (toxic, severe toxic, insult, obscene, identity hate, and threat). A comment could fall into multiple categories, or none at all.

I soon realised that the Kaggle community is quite generous in providing guidance and examples in the form of kernels (code notebooks). Many people create these to give an introduction to a particular technique or model type.

The models I ended up using in my final ensemble included:

Long short term memory (LSTM) recurrent neural networks (RNN), using embeddings from:

GloVe common crawl 840b 300d
GloVe twitter 27b 200d
FastText wikipedia 300d

Gated recurrent units (GRU) RNNs, using the same embeddings
Logistic regression
Naive Bayes support vector machine (SVM)
Convolutional neural networks (CNN)
Gradient boosting

And to support hyperparameter optimisation and bagging I used the keras framework.

Sometimes in NLP (Natural Language Processing) problems, preprocessing the text data can help. However in this particular case I found that some tradiational preprocessing approaches can reduce the amount of information it is possible to derive from words or sequences of words. Ultimately, the only preprocessing I applied here was to a) translate non-English comments, and b) correct spelling of commonly mispelt offensive words.

The winners' write-up is very interesting. I highly recommend a read if you participated in this competition, or are considering a similar NLP problem.

One mistake I made early on, was not to strictly prepare a hold out set for validation. I had not anticipated becoming so heavily involved in this challenge, and so at the beginning I was not sufficiently disciplined in my cross validation techniques. That certainly cost me some places on the leaderboard — and I've learnt from that experience.

Ultimately I enjoyed this competition immensely. The steep learning curve with some of these models and techniques often meant a deep dive into the academic papers which proposed these approaches, sometimes only a matter of months ago. Then there was endless testing, troubleshooting (e.g. getting CUDA and cuDNN to run on Fedora), and at the end of the day a nerve-wracking upload of the submission you think best predicts the comment toxicity.

But I think the most important thing is, as I said on the Kaggle forums:

I’ve learned more from practical, hands-on, hacking away at a concrete problem than I would have in six weeks of classes.

Now I'm looking forward to the next competition! If you're also interested in participating, let me know and we can discuss teaming up!

A first experience with Kaggle competitions

April 10, 2018 Kaggle NLP Machine learning

Related Posts

May 10, 2019

What I'm working on — May 2019

November 18, 2018

Visualising large data sets in R

June 21, 2018

Statistics in journalism