SineReLU — An Alternative to the ReLU Activation Function

8 min readJun 4, 2018

Interlude

It’s been almost a year since I came up with this new activation function. Probably, you haven’t heard about it. Actually, I’m pretty sure you didn’t. But, why not? Well, it’s not in the main stream and also neither being sponsored by a big corporation nor the academia. Perhaps the question now is: is it relevant?

I could take another ten minutes explaining why it is relevant. However, to save some time of those in a hurry, you might be willing to jump more or less over half of the story. Nevertheless, if you want to know how it all started, be my guest and let’s get through this journey together.

IBM Watson AI XPRIZE Competition

Let’s get started with the boring part of this. It was somewhere in 2016 when I heard about the IBM Watson AI XPRIZE competition. To be honest, I got a bit excited. It was not about the money, but about the opportunity to make a difference and help people in need. You know there are lots of NGOs/NPOs out there, but we keep wondering why there is still poverty. So, when I saw news on the Internet about the competition, I immediately thought that I should be able to do something to help some people around the world.

That’s touching!

But what did I do or come up with? Well, after spending about 4 yers studying Artificial Intelligence (i.e. AI) through books by Jeff Hawkins, Daniel Jurafsky, Peter Norvig, Kevin Warwick, and seventeen courses on Coursera by Andrew Ng, Geoffrey Hinton, Rajesh Rao and Adrienne Fairhall, I thought I could be of some help.

Putting it all together, I have followed 27 weeks of courses with Professor Andrew. I have heard him saying after each module was finished, that I knew more then a certain amount of people walking around in Silicon Valley. However, that’s not what triggered me. What really made my brain get into something different was when he said: “but you don’t need to come up with a new activation function, for example. There are researches working on it.”. At that moment, in my mind, I had: challenge accepted.

So, once the IBM Watson AI XPRIZE competition started, I also started thinking about what to send with regards to a plan. I then thought that by creating a better activation function it should help small companies, without a huge amount of resources, to train models in a better way, increasing accuracy and reducing loss. It wouldn’t save hunger in the planet, but it could help small companies around the globe to achieve something with a small dataset.

The whole thing started in 2016. Around mid 2017, I didn’t have anything. It was a bit frustrating and difficult to tackle. By that time, I knew most of the existing activation functions and their problems. I also knew about regularisation and optimisation mechanisms to help during training. Although unfortunately, they didn’t bring any light upon my pool of ideas. In just two weeks before the submission deadline, something came to my mind, something that professor Adrienne Fairhall said about the fact that we don’t know how the brain works, not in its entirety, and perhaps some uncertainty could help to understand it.

Well, I got into this idea and simply thought that instead of having a ReLU, where everything that is negative gets clamped to zero, we could have uncertainty. It could be just over or below zero. But most important, and not like the ReLU, it should be differentiable everywhere.

The SineReLU was born!

It would really take 5 minutes to explain the whole intuition about it. Perhaps, given the name, you might already have some idea about how it has been conceived.

But before we get into details, let’s look at what it wrong with some of the most common activation functions out there.

Sigmoid and Tanh

There is not much to say here, not if you have read the astonishing story by Andrej Karpathy on understanding back-propagation. What I will do here, is to sum-up what he explained in his post.

To start with, let’s talk about vanishing and exploding gradients. Look at the text below:

When the weights are too large, the output of the matrix multiply could have a very large range — which will make all outputs in the vector z almost binary, either 1 or 0 — but if that is the case, z*(1-z), which is local gradient of the sigmoid non-linearity, will in both cases become zero — “vanish”, making the gradient be zero. In that case, what happens to the rest of the gradient? Or better, what happens when one multiplies by zero? Bam! Weights are not updated, neurons are dead. Some bad things also happen when the output vector is 1. In that case, we have a gradient explosion and the weights get updated to huge numbers.

And that’s all because something like this:

Source: https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b

ReLU

The Rectified Liner Unit, or ReLU, is simple and elegant. It’s a sparse function, which also makes it easier to embed it in hardware. It does not require complex mathematical operations, as you might infere from the graph below. However, in its simplicity, the ReLU might cause an irreversible problem.

It causes brain damage!

The thing is, when initialising the weights to be used in an either fully connected neural network or a convolutional neural network, a portion of those those weights can be equals or less than zero. When that happens, the ReLU will flatten those weights to zero. And again, what happens when one multiplies by zero? Bam, neurons are dead!

Just to make things clearer, thats the graph of the ReLU function:

So, whatever that’s equals to zero is pretty much dead and it’s irreversible.

If you want to have an idea about the impact of this, just run the ReLU function on some initialised weights and you will see that most of the time the dead neurons (affected by the dying ReLU effect) represent approximately 40% of the neurons population.

But, what happens during back-propagation? Well, nothing cool!

During he forward pass, the neurons are dead. When the network starts computing the back-propagation, the weights won’t get updated due to the dying effect. Thus, it means the this problem causes irreversible “brain” damage in the network neurons.

Here is a plot of the derivative, calculated during the back-propagation, of the ReLU activation function:

But how to fix the Dying ReLU issue?

Some other functions in the “rectifier family” try to solve this problem by making themselves dense, compared to the sparsity of the ReLU. It has been proven to be a better approach, but 2 things come in place when have more density: computation and parameter tuning.

Training the same model several times just to tune an extra learning parameter is not the most cost effective way for a company. In that sense, simply applying the ReLU seems to be a better option.

If you want to know more about those other activation functions, please google for:

Leaky-ReLU;
PReLU;
RReLU; and
S-Shaped ReLU

Let’s have a look at the newest cousin of the ReLU activation function…

The SineReLU

The idea is pretty simple and easy to explain: the SineReLU is a differentiable function. So, no matter what, no neuron will die! But how it looks like? Have a look below:

Based on the graph depicted above, you might see some oscillations in the line next to the X axis. That’s a sinusoidal wave, it passes through zero for every input given.

But wait! If it gets through zero, how is the function differentiable? Well, look at the formula below:

So, whenever there is a weight that is lesses or equal to zero, we apply the equation:

ε(sin(Z) - cos(Z))

The ε works as a hyper parameter, used to control the wave amplitude. There are some default values to be used, but we will get to that once I let you know where it can be found.

An important aspect of this function is its differentiability. And why is that? Well, for a given Z = 0, the output of the function will be -1 (not taking in account the epsilon parameter).

It seems to behave better than the ReLU. But wait, let’s look a bit further.

The Derivative

Nowadays, with the use of meta-frameworks like Keras, there is an increasing number of Machine Learning enthusiasts who doesn’t know about what is going on under the hood of those frameworks. This assumption is not based on any article, but chiefly from talking to people at meetups and conferences and finding out where their knowledge is concerning inner parts of deep learning and the algorithms involved.

Here I have already tried to explain a few concepts about forward pass and back-propagation. Bu now, to focus more on the benefits of the SineReLU: what happens with its derivative?

SineReLu Derivative: an image speaks more than a thousand words.

Based on the graph above, it seems clear that the SineReLU is a differentiable function. But the next question is: does it work?

Yes, it does. Actually, it performs better than the ReLU most of the times (remember Fairhall uncertainty?).

Metrics

Of course, just saying it performs better most of the times doesn’t prove anything if numbers are not shown.

The SineReLU function has been tested with several datasets, from the classic ones (e.g. MNIST, IMDb) to some more exotic ones (e.g. Kaggle Toxicity and Statoil/C-CORE Iceberg Classifier Challenge). The metrics in the tables below are based on the MNIST classification, with the following hyper parameters:

Layers: 5
Dropout: 20%; 30%; 40%; and 50%. From the 2nd convolutional layer to the fully connected layer.
Epochs: 40
Optimiser: Adam
SineReLU epsilon: 0.0025 (CNN layer); 0.025 (Dense layer).

SineReLU performance compared to the ReLU activation function.

More Computation

Since the function is differentiable, it does more computations. Hence, it’s slower than the ReLU. For instance, running the MNIST dataset with a LeNet-5 on .a MacBook pro, whilst the ReLu takes about 3h45min, the SineReLU takes about 4h15min total.

Where can I find it?

The results presented were good enough to bring the SineReLu to the Keras meta-framework, where it can be used by anyone.

However, due to the lack of popularity (it’s still pretty young), the SineReLU function is only available via the keras-contrib github repository. If you all, reading this post, start using it, I’m sure it will make its way up to the major branch of Keras.