Motivation

In this article, I will explore information-theoretic quantities (mutual information) to perform feature selection for models on continuous variables, like a sequence of price returns, for instance. Mutual information offers many advantages, including:

Model-independent feature selection: Many feature selection techniques are inherent to a particular class of models. For instance, boruta, gini, or mean impurity decrease use tree-based models to rank the importance among a set of features. While there is an obvious value in selecting features whose predictive power is better leveraged by a particular class of models, this approach may lead to the drop of informative features if the end model is not of the same class as that used for feature selection. In this context, mutual information offers a complete model-agnostic approach to feature selection.
Full distribution feature selection: If we are trying to model a continuous variable, most feature selection techniques end up selecting features that do better in predicting the mean of the target variable. If we are trying to forecast the return of a given price series, for instance, it is as important to predict its mean as it is to predict is uncertainty (volatility), which will allow a better sized allocation to that particular strategy or forecast. This means that we often may benefit from selecting features that carry information about different aspects of the distribution of our target, and not only its mean. Mutual information does exactly that. Even if we are working with classification models, which output buy and sell signals according to some rules, it is extremely important to have a way to model the confidence of these signals. For instance, (De Prado, 2018) describes using meta-labeling and the training of a surrogate model to predict the confidence on the buy and sell signals derived from the principal model. The best features for this surrogate model may be different than those of the first model.

Two different kinds of relationships

Let's conduct a controlled experiment using two different scenarios:

Variable $X$ (cause) partially determines the mean of variable $Y$ (effect)
Variable $X$ (cause) partially determines the scale (standard deviation) of variable $Y$ (effect)

For the sake of simplicity, let's restrict ourselves to normal distributions here.

Let's define a function that generates these samples:

Note: The entire code shown in this article has been written for the purpose of clarity rather than efficiency

import numpy as np

def normalize(sample):
	return (sample-np.mean(sample))/np.std(sample)


def generate_samples(N, dependence_coef, which_param):
	### Cause
	samples_cause = np.random.normal(loc=0.0, scale=1.0, size=N)

	### Effect
	inherent_noise = np.random.normal(loc=0.5, scale=1.0, size=N)
	param = normalize((1-dependence_coef)*inherent_noise + dependence_coef*samples_cause)

	samples_effect = list()
	for i in range(0, N):
		if which_param == 1: # relationship on the mean
			mean = param[i]
			scale = 1.0
		if which_param == 2: # relationship on the scale
			mean = np.random.normal(loc=np.random.normal(loc=0, scale=1.0))
			scale = np.exp(param[i])
	
		[samples_effect.append(np.random.normal(loc=mean, scale=scale))]

	return [normalize(samples_cause), normalize(samples_effect)]

Note that we are allowing a varying level of "causation" between $X$ and $Y$, determined by dependence_coef which varies between 0 (no relation) and 1 (strong relation). Also, strong relation does not imply a perfect correlation between the variables. We are constantly drawing points from univariate normal distributions, where the parameters of the distribution of $Y$ - either mean or scale - depend on the value of $X$, at each observation. This ensures a more complex and non-trivial stochastic relation between the two random variables.

Let's generate these samples, assuming a dependence coefficient of 0.5:

N = 10000
dependence_coef = 0.5

# Relationship on the mean
(cause_mean, effect_mean) = generate_samples(N=N, dependence_coef=dependence_coef, which_param=1)

# Relationship on the scale
(cause_scale, effect_scale) = generate_samples(N=N, dependence_coef=dependence_coef, which_param=2)

And now let's plot the samples:

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(10, 4), constrained_layout=True)
axes[0].plot(cause_mean, effect_mean, '.', color=(0.8,0.5,0.5,0.2))
axes[0].set_xlabel("Cause - X")
axes[0].set_ylabel("Effect - Y")
axes[0].set_title("Relation on the mean")
axes[1].plot(cause_scale, effect_scale, '.', color=(0.8,0.5,0.5,0.2))
axes[1].set_xlabel("Cause - X")
axes[1].set_ylabel("Effect - Y")
axes[1].set_title("Relation on the scale")
_=plt.suptitle("Dependence coefficient = "+str(dependence_coef))

As visible from above, on the left-hand side, $X$ is a good feature to use in predicting the mean of $Y$, but completely unrelated to its variance, which is completely stochastic by construction. On the other hand, on the right-hand side, $X$ is completely unrelated to the mean of $Y$, but it's moderately related to its scale.

Let's check other dependence coefficients:

Note: The way I constructed the "Relation on the mean" above results in an essentially linear relationship between $X$ and $Y$, which could be quantified by their respective covariance. However, covariance and correlation fail when the relationship between the mean of $X$ and $Y$ is nonlinear, whereas mutual information does not. The "Relation on the scale" construction, however, goes beyond both linear and non-linear relationships between the means of $X$ and $Y$. Despite that, and as I will demonstrate now, mutual information remains able to quantify the strength of this dependence.

Entropy, Joint Entropy and Mutual Information

I'm going to quickly review the definition of entropy, joint entropy, and mutual information. A more detailed, yet simple, introduction to the subject can be found on Wikipedia, for instance. For a more complete description, (Cover, 1999) is a great reference.

For a discrete random variable $X$, we can define the Shannon entropy as

$$S(X) = - \displaystyle\sum_{i=1}^N p(x_i) \mathrm{log}_2 [ p(x_i) ], $$

and the joint entropy of $X$ and $Y$ as

$$S(X,Y) = - \displaystyle\sum_{i=1}^{N_x} \displaystyle\sum_{j=1}^{N_y} p(x_i, y_j) \mathrm{log}_2 [ p(x_i, y_j) ].$$

Note that, if $X$ and $Y$ are independent, $S(X,Y) = S(X) + S(Y)$. The mutual information can then be defined as

$$I(X,Y) = S(X) + S(Y) - S(X,Y),$$

thus measuring the "amount" of information that is shared between $X$ and $Y$. As such, if $X$ and $Y$ are independent, $I(X,Y)=0$. An equivalent definition of mutual information can be written in terms of the conditional entropy:

$$I(X,Y) = S(X) - S(X|Y) = S(Y) - S(Y|X).$$

We can then understand the mutual information as the information that we get about $X$ by knowing only $Y$, or vice versa.

Note: The choice of logarithm of base 2 above is somehow arbitrary, changing only the units we use to measure the "quantity" of information. The base 2 choice allows for this quantification to be made in terms of number of bits.

Sample estimation

While the theoretical definitions are simple and straightforward, robustly estimating these quantities from sample data is often tricky, especially when dealing with continuous random variables. Given that we are using synthetic data here, we can produce as large samples as we want, getting good statistical convergence if so desired. We thus take the simplest approach possible of histogramming the observations and applying the above definitions directly.

Let's define some functions:

def calculate_entropy(X):
	# 1) Histograms the samples
	nbins = int(len(X)**(1/3))
	p = np.histogram(X, bins=nbins, density=False)[0]
	p = p/np.sum(p)+1e-6
	# 2) Calculates the entropy
	entropy = -np.sum(p*np.log2(p))
	
	return entropy

def calculate_joint_entropy(X, Y):
	# 1) Histograms the samples
	nbins = int(len(X)**(1/3))
	p = np.histogram2d(X, Y, bins=nbins, density=False)[0]
	p = p/np.sum(p)+1e-6
	# 2) Calculates the entropy
	entropy = -np.sum(p*np.log2(p))
	
	return entropy

def calculate_mutual_information(X, Y):
	S_X = calculate_entropy(X)
	S_Y = calculate_entropy(Y)
	S_XY = calculate_joint_entropy(X, Y)
	I = S_X+S_Y-S_XY
	return I

Dealing with discretization scaling

From the point of view of estimating entropy and joint entropy, the discretization procedure above introduces a scaling that needs to be controlled for.

To deal with this issue, besides estimating the mutual information for the original sample $(x_i, y_i)$ - $I_{\mathrm{sample}}$ - we are going to estimate the mutual information on a number of datasets where the observations of one of the original variables, say $Y$, are randomly permutated - $I_{\mathrm{perm}}^j$. We can then defined the mutual information score

$$s_I = \frac{I_{\mathrm{sample}} - \mathrm{mean}(I_{\mathrm{perm}}^j)}{\mathrm{sd}(I_{\mathrm{perm}}^j)}.$$

By doing this normalization, we are arriving at a quantity that is insensitive to scaling issues arising from the discretization of $X$ and $Y$. In essence, $s_I$ measures the confidence, in terms of the number of standard deviations, that the relation between $X$ and $Y$ is not random.

Some code:

def calculate_mutual_information_score(X, Y, n_perm):
	# Mutual information on original samples
	I = calculate_mutual_information(X=X, Y=Y)

	# Mutual information on randomly shuffled data
	I_perm = list()
	ind = np.arange(len(Y))
	for i in range(0, n_perm):
		np.random.shuffle(ind)
		Y_shuffled = Y[ind]
		I_perm.append(calculate_mutual_information(X=X, Y=Y_shuffled))

	# Calculates the mutual information score
	mi_score = (I-np.mean(I_perm))/np.std(I_perm)

	return mi_score

Let's now conduct some estimations of the mutual information score.

Let's begin with a small dependence between $X$ and $Y$ (here, $X$ related with the mean of $Y$):

(cause, effect) = generate_samples(N=100000, dependence_coef=0.05, which_param=1)

n_perm = 100
mi_score = calculate_mutual_information_score(X=cause, Y=effect, n_perm=n_perm)
print(np.round(mi_score,1))

0.9

and a slighly stronger relation:

(cause, effect) = generate_samples(N=100000, dependence_coef=0.20, which_param=1)

n_perm = 100
mi_score = calculate_mutual_information_score(X=cause, Y=effect, n_perm=n_perm)
print(np.round(mi_score,1))

67.9

Let's do a more systematic experiment, by slowly increasing the dependence coefficient from 0 to 1, and looking at the mutual information score for the 2 cases we have been considering - mean and scale relation:

dependence_coefs = np.linspace(0, 1, 20)

mi_scores_mean = list()
mi_scores_scale = list()

for coef in dependence_coefs:

	### On mean
	samples = generate_samples(N=100000, dependence_coef=coef, which_param=1)
	samples_cause = normalize(samples[0])
	samples_effect = normalize(samples[1])
	mi_scores_mean.append(calculate_mutual_information_score(X=samples_cause, Y=samples_effect, n_perm=50))
	
	### On scale
	samples = generate_samples(N=100000, dependence_coef=coef, which_param=2)
	samples_cause = normalize(samples[0])
	samples_effect = normalize(samples[1])
	mi_scores_scale.append(calculate_mutual_information_score(X=samples_cause, Y=samples_effect, n_perm=50))

And now plotting the results:

fig, axes = plt.subplots(1, 2, figsize=(10, 3))
axes[0].plot(dependence_coefs, mi_scores_mean, '-', color=(0.8,0.5,0.5,1.0))
axes[0].set_xlabel("Dependence coeficient")
axes[0].set_ylabel("Mutual information score")
axes[0].set_title("Relation on the mean")
axes[1].plot(dependence_coefs, mi_scores_scale, '-', color=(0.8,0.5,0.5,1.0))
axes[1].set_xlabel("Dependence coeficient")
axes[1].set_ylabel("Mutual information score")
axes[1].set_title("Relation on the scale")
plt.show()

We correctly infer the increasing strength of the relationship between $X$ and $Y$. Note that, we are arriving at such large values of the mutual information score (or equivalently, such high confidence about the non-randomness of the relationship) because of the large number of points in our samples (100000). If we re-run the experiment with 10000 points we obtain:

Dealing with sample fluctuations

In any statistical estimation problem, we are subject to the problem of sample fluctuations. Even if our data-generating process is stationary, which is the case here, different samples will lead to different mutual information scores.

The problem is more severe if the data generating process is not stationary, which is often the case in financial time series. In this case, regime or structural shifts can change the relation between variables over time. The most we are left to do is to investigate the non-uniformity of our sample using empirical techniques, like resampling, and analyze inter-sample variations.

While bootstrap and bagging, for instance, would be appropriate for the experiment we are conducting, is not appropriate in financial time series, because it does not maintain the time ordering of the observations. The premise here is that while regime shifts may occur, there is some level of persistence of a given regime.

With this in mind, we are going to consider sequential resampling:

def calculate_resample_inds(N, n_groups):
    inds = np.arange(N)
    resample_inds = np.reshape(inds[0:n_groups*int(np.floor(N/n_groups))], (-1, n_groups))
    return resample_inds

And let's generate the samples:

n_groups = 10
dependence_coefs = np.linspace(0, 1, 20)

mi_scores_mean = list()
mi_scores_scale = list()

# Resampling indices
inds = calculate_resample_inds(N=100000, n_groups=n_groups)

for coef in dependence_coefs:
    
    ### On mean
    # Full sample
    samples = generate_samples(N=100000, dependence_coef=coef, which_param=1)
    samples_cause = normalize(samples[0])
    samples_effect = normalize(samples[1])
    # Resampling
    vals = [calculate_mutual_information_score(X=samples_cause[inds[:,i]], Y=samples_effect[inds[:,i]], n_perm=50) for i in range(0, n_groups)]
    mi_scores_mean.append(vals)
    
    ### On scale
    # Full sample
    samples = generate_samples(N=100000, dependence_coef=coef, which_param=2)
    samples_cause = normalize(samples[0])
    samples_effect = normalize(samples[1])
    # Resampling
    vals = [calculate_mutual_information_score(X=samples_cause[inds[:,i]], Y=samples_effect[inds[:,i]], n_perm=50) for i in range(0, n_groups)]
    mi_scores_scale.append(vals)
    
mi_scores_mean = np.array(mi_scores_mean)
mi_scores_scale = np.array(mi_scores_scale)

And now plotting the results:

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].boxplot(mi_scores_mean.T)
axes[0].set_xticklabels(np.round(dependence_coefs,2), rotation=90)
axes[0].set_xlabel("Dependence coeficient")
axes[0].set_ylabel("Mutual information score")
axes[0].set_title("Relation on the mean")
axes[1].boxplot(mi_scores_scale.T)
axes[1].set_xticklabels(np.round(dependence_coefs,2), rotation=90)
axes[1].set_xlabel("Dependence coeficient")
axes[1].set_ylabel("Mutual information score")
axes[1].set_title("Relation on the scale")
plt.show()

This resampling partially quantifies sample fluctuations and inhomogeneities. However, while in this conducted experiment we can generate samples of arbitrary size, in real datasets we are limited by the possible small number of available observations. In this case, excessive resampling, especially sequential resampling like we're doing here may lead to completely useless information.

Conclusions

The idea behind this exposition was to demonstrate the strengths of mutual-information based feature selection. One big advantage is the fact we can infer relationships that are not captured by other feature selection techniques. In the context of trading, it is tremendously important to be able to predict not only the direction of the next price movement but also the volatility or uncertainty in this forecast. This can be done, for example, by developing full probabilistic regression models or, in the context of classifications models, using the meta-labeling technique, as described in (De Prado, 2018). Mutual information-based feature selection can then be used to construct these models.

References:

De Prado, M. L. (2018). Advances in financial machine learning. John Wiley & Sons.
Cover, T. M. (1999). Elements of information theory. John Wiley & Sons.