Solving a language identification problem with machine learning

(and deploying the solution to the AWS cloud)

Introduction

Artificial intelligence (AI) and machine learning (ML) are two of the most popular phrases today. They are touted as being the keys to unlocking the future, and solving the world's problems. But sometimes it's hard for people to understand exactly what kinds of problems these techniques can solve, how a computer can "learn", and what challenges exist in developing these solutions.

Traditional computer programs are very good at following instructions; that is, they do exactly what their programmers instruct them to do. A programmer tells a computer, if X is true, then do Y. And it will do it. A programmer can take a list of a million numbers, and tell a computer to add up add up these numbers, and it will do it in seconds.

This is what we might refer to as machine teaching, where we teach the computer exactly what we want it to do. However what I am going to demonstrate here is machine learning, the process by which we have the computer learn for itself how to perform a task, through the observation of many data points.

Problem description

The types of problems which are typically being solved with machine learning have at least one of several characteristics :

The solution is hard to define explicitly. Think about the task of recognising an object in an image or in a video. How would you start to describe to a computer rules for identifying a cat in an picture, or a pedestrian in a video filmed from a camera on your autonomous car? Remember that a computer sees only zeros and ones! Machine learning allows a computer to generate an algorithm to identify the features in images itself, without human guidance.
Humans can only provide a good solution extensive training and experience. For example, the identification of a cancerous tumour in a scan of an organ, or detection of disease in a retinal scan. Medical professionals take many years to train, and their time is both expensive and scarce. Machine learning makes it possible for a computer to perform tasks like these at a level of accuracy comparable to the best professionals in the world - or often better, and in a fraction of a second.
Massive volumes of data are needed to derive insight. To maximise their profit, media content providers and retailers want to provide their customers with good recommendations of what to watch or buy next. By ingesting vast quantities of data and using machine learning to process it, companies are able analyse and provide the best recommendations out of millions of possibilities.

To demonstrate a practical example of how machine learning can be used to solve these types problems, I have chosen the relatively simple task of language identification. To make this a task which can be solved in under an hour, I will use machine learning to define a model to identify whether place names in New Zealand have a Maori origin or not. This is a task which is easy for any person to do (well, any Kiwi at least), but we will let the algorithm decide how to build rules to complete it. Following my guidance, you should be able to address a similar problem in under an hour.

In case you're not interested in reading about the details of how to build the model, you can test it directly here!

Building the model

For this task, I'm going to use logistic regression. This is a technique which is both venerable (dating back to 1958) and easy to understand. I could have chosen to use more complex models (such as a neural network) to solve this problem, but it's not necessary to use a sledgehammer to crack a nut.

The objective of logistic regression is to create a model which will take an input variable (e.g. a place name), and predict a binary output variable (e.g. whether it is a Maori name or not). One benefit of using logisitc regression is that it works with unbalanced data, so we don't have to worry too much about the proportion of positive and negative cases in our dataset.

That brings us to the next task: collecting the dataset. Anyone who has done any kind of data analytics will know that often the majority of their time will be spent collecting, cleaning, formatting, and otherwise wrangling data. Fortunately, we don't have to do much work here. I just grabbed a list of towns in New Zealand, and assigned each a flag of either 1, if it is a Maori name, and 0 if it is not. I removed descriptive words such as "north", "upper", and "beach" as these are going to be attached to places in any language.

Enough explanation! Let's jump into the python code. We will start by importing some key modules we will need.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
import re, string
import time
from sklearn.model_selection import StratifiedKFold

Now we can load the data we've prepared into a pandas dataframe, remove any duplicate records, and remove any spaces in the text.

data = pd.read_csv('names1.txt',sep=r'\t', engine='python', names=['Town','Maori'])
data = data.drop_duplicates()
data['Town'] = data['Town'].str.replace(' ','')

Split the dataset into train and test sets. Setting aside a portion of the data (in this case, 20%) allows us to independently validate the accuracy of our model by testing it against place names which weren't used to train the model.

train, test = train_test_split(data, test_size=0.2,stratify=data['Maori'])
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

Here's the secret sauce. We will use a TF-IDF (Term Frequency-Inverse Document Frequency) statistic to represent how common a feature is in a name in comparison to the entire corpus of names. But how do we define what the features are? For this task, I'm going to split the word into all sequential combinations of 1, 2, and 3 letters. These are also known as n-grams. For example, the word 'auckland' would be split into the set {'a', 'au', 'auc', 'u', 'uc', 'uck',...}. Note I have also set the min_df option to exclude any n-gram which does not appear in at least two words in the dataset.

char_vectoriser = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    min_df = 2,
    ngram_range=(1, 3),
    max_features=100000)

We've just prepared our vectoriser! Now we fit it over our training data, to learn about the occurences of the features.

char_vectoriser.fit(train['Town'])

And we can use the document frequencies learned by the fit to transform both our training and test datasets to a document-term matrix.

xtrain = char_vectoriser.transform(train['Town'])
xtest = char_vectoriser.transform(test['Town'])

Call the classifier for logistic regression and fit it against our training document-term matrix and the predicted values.

classifier = LogisticRegression(C=5, solver='liblinear',dual=True)
classifier.fit(xtrain, train['Maori'])

We can also evaluate our model using cross validation to assess how our model will generalise.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(classifier, xtrain, train['Maori'], cv=4, scoring='accuracy')
print("CV accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

This gives an output like CV accuracy: 0.97 (+/- 0.02), indicating that our model should generalise quite well.

To wrap up we prepare a container for evaluating our predictions against the test dataset, and then call on our trained classifier to predict values.

final_results = pd.DataFrame.from_dict({'Town': test['Town']})
final_results['Maori'] = classifier.predict(xtest)

Let's check how our model did on the hold out (out of fold) test dataset.

print('%0.3f' % (1-abs(final_results['Maori'] - test['Maori']).sum()/test['Maori'].size))

This returns 0.991, indicating that our model has correctly predicted 99% of names in our test dataset! That's a good model - it completes the task with a high level of accuracy. If we worried about the last percent, we could improve it by training it on a larger dataset.

Deploying to AWS Lambda

I'm going to demonstrate how a machine learning model can be deployed to AWS Lambda. If you haven't used this service before, I strongly encourage you to check it out. The serverless computing paradigm has been something of a revelation for me. It has completely changed the way I think about developing and deploying functions and applications, by completely abstracting away server management, scaling, and availability.

To start with, we will take out model and pickle it. This will serialise our model into a binary format, which will be included in our Lambda function, and then unpickled (deserialised) when we need to call the model.

import pickle
pickle.dump(classifier,open('maori_classifier.pickle', 'wb'), protocol=2)
pickle.dump(char_vectoriser,open('maori_char_vectoriser.pickle', 'wb'), protocol=2)

While the Lambda environment includes a basic python installation, it does not include the scikit-learn module which we will need to evaluate new place names submitted to our function. So we will need to include this (along with the numpy and scipy dependencies) in our package. We can't just grab the modules we are using locally, since they need to be compiled for the environment in which they are going to run. In this case, the AWS Lambda execution environment is based on the Amazon Linux AMI. So you will need to fire up an instance of that (either locally or with AWS EC2), install scikit-learn, and locally save the built packages. There is a more detailed guide to doing this written by Ryan Brown at the Serverless Code site. He has also created a useful script for automating this process.

That done, we will turn our attention to the code that needs to be run in the Lambda environment. We want to define a lambda_handler function, which will process the event our request passes. Our request will look something like this: { "httpMethod": "GET", "queryStringParameters": { "PlaceName": "Whakapapa" } }

import os
import pickle
import ctypes

for d, _, files in os.walk('lib'):
    for f in files:
        if f.endswith('.a'):
            continue
        ctypes.cdll.LoadLibrary(os.path.join(d, f))

import sklearn

def lambda_handler(event, context):
    
    name = event.get('queryStringParameters')['PlaceName']

    classifier = pickle.load(open('maori_classifier.pickle', 'rb'))
    char_vectorizer = pickle.load(open('maori_char_vectorizer.pickle', 'rb'))
    result_a = classifier.predict(char_vectorizer.transform([name]))[0]
    result_b = classifier.predict_proba(char_vectorizer.transform([name]))[0]

    result_b_0 = result_b[0]
    result_b_1 = result_b[1]

    string_resp = ''
    if result_a == 0:
        string_resp = string_resp + 'The model predicts that ' + name + ' is not a Maori name, with probability ' + '{:.1%}'.format(result_b_0) + '.'
    else: 
        string_resp = string_resp + 'The model predicts that ' + name + ' is a Maori name, with probability ' + '{:.1%}'.format(result_b_1) + '.'

    response = {"statusCode": 200, \
        "headers": {"Content-Type": "application/json", "Access-Control-Allow-Origin": "*"}, \
        "body": "{\"result_a\": " + str(result_a) + ", \"result_b\": " + str(result_b) +  ", \"string_resp\": \"" + string_resp + "\"}" }

    return response

Now take that code, zip it up with the packages we prepared previously, and upload it to a new AWS Lambda function. Test your code with a sample JSON payload like the one above.

There is one final thing to do to finally make your function available. Add a trigger in the form of an API Gateway, and enable CORS if necessary.

And there we have it! Our model has been made available for anyone to evaluate any text they submit. Try it out for yourself here.

If you have any questions or comments on this post, please don't hesitate to contact me.