Monochrome (Black & white) plots in matplotlib

Posted on Wed 10 August 2016 in Notebooks

Monochrome (Black & white) plots in matplotlib

While writing my thesis, I was annoyed that there wasn't any good default options for outputting monochrome plots, as I didn't count on being able to expect that all prints would be in color. I therefore wanted plots that could work without any greyscales.

Right now this notebook describes how to setup and use line plots and bar plots. If you need other types of plots, do not hessitate to contact me, and I'll see what I can do.

In [1]:
from sklearn import datasets
from matplotlib import pyplot as plt
%pylab
%matplotlib inline
import numpy as np
Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib

Line and marker styles

To get an idea of which line styles and markers are available we can inspect the lines and markers object

In [2]:
from matplotlib import lines, markers
In [3]:
lines.lineStyles.keys()
Out[3]:
dict_keys(['', ' ', '--', ':', 'None', '-', '-.'])
In [4]:
markers.MarkerStyle.markers.keys()
Out[4]:
dict_keys([0, 1, '*', 3, 4, 5, 6, 7, '8', 'None', 'd', 'h', 'D', 'v', None, '^', ',', '>', 'x', '<', 's', 'p', '', '2', '4', ' ', '_', 'o', '+', 'H', '|', 2, '1', '3', '.'])

Cycle through line and marker styles

First we are going to create a cycler object, that we will use to cycle through different styles. Using this object we can have a new line-style every time we plot a new line, and don't have to manually ensure that our lines are monochrome and different.

Cycler objects can be composed of several cycler objects and will iterate over all permutations of its components forever. Let us create a cycler object that cycles through several line and marker styles all with the color black.

In [5]:
from cycler import cycler

# Create cycler object. Use any styling from above you please
monochrome = (cycler('color', ['k']) * cycler('linestyle', ['-', '--', ':', '=.']) * cycler('marker', ['^',',', '.']))

# Print examples of output from cycler object. 
# A cycler object, when called, returns a `iter.cycle` object that iterates over items indefinitely
print("number of items in monochrome:", len(monochrome))
for i, item in zip(range(15), monochrome()):
    print(i, item)
number of items in monochrome: 12
0 {'color': 'k', 'linestyle': '-', 'marker': '^'}
1 {'color': 'k', 'linestyle': '-', 'marker': ','}
2 {'color': 'k', 'linestyle': '-', 'marker': '.'}
3 {'color': 'k', 'linestyle': '--', 'marker': '^'}
4 {'color': 'k', 'linestyle': '--', 'marker': ','}
5 {'color': 'k', 'linestyle': '--', 'marker': '.'}
6 {'color': 'k', 'linestyle': ':', 'marker': '^'}
7 {'color': 'k', 'linestyle': ':', 'marker': ','}
8 {'color': 'k', 'linestyle': ':', 'marker': '.'}
9 {'color': 'k', 'linestyle': '=.', 'marker': '^'}
10 {'color': 'k', 'linestyle': '=.', 'marker': ','}
11 {'color': 'k', 'linestyle': '=.', 'marker': '.'}
12 {'color': 'k', 'linestyle': '-', 'marker': '^'}
13 {'color': 'k', 'linestyle': '-', 'marker': ','}
14 {'color': 'k', 'linestyle': '-', 'marker': '.'}
In [6]:
# ipython can also pretty pring our cycler object
monochrome
Out[6]:
'color''linestyle''marker'
'k''-''^'
'k''-'','
'k''-''.'
'k''--''^'
'k''--'','
'k''--''.'
'k'':''^'
'k'':'','
'k'':''.'
'k''=.''^'
'k''=.'','
'k''=.''.'

Create monochrome figure and axes object

Most people learn matplotlib through pyplot, the command style functions that make matplotlib work like MATLAB. Meanwhile there is also the more direct approach, manupulating matplotlib objects directly. It is my experience that this is more powerful and as far as I can tell, we can't make monochrome plots without using the object-oriented interface, so in this tutorial I will try and use it as much as possible.

It is however a cumbersome interface, so it appears that the people behind matplotlib recommends mixing both, so will I.

First, let us take a look at an empty plot:

In [7]:
plt.plot();

If we draw a number of lines, we can see that the default behavior of matplotlib is to give them different colors:

In [8]:
for i in range(1,5):
    plt.plot(np.arange(10), np.arange(10)*i)

Add a custom cycler

Let us add the monochrome cycler as the default prop_cycle for the next plot. This plot we will generate using the object approach. The subplots function (notice the s) returns a figure object and any number of axes objects we ask it to. I find this the easiest way to get both of these objects, even for plots with only 1 ax.

In [9]:
fig, ax = plt.subplots(1,1)
ax.set_prop_cycle(monochrome)
for i in range(1,5):
    ax.plot(np.arange(10), np.arange(10)*i)

Set a grid and clear the axis for a prettier plot

In [10]:
fig, ax = plt.subplots(1,1)
ax.set_prop_cycle(monochrome)
ax.grid()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
for i in range(1,5):
    ax.plot(np.arange(10), np.arange(10)*i)

Override styles for current script

Writing all the ax.set_grid ... code for every figure is tedious. We can tell matplotlib to set all new figures with a particular style.

All styles are saved in a dictionary in plt.rcParams. We can override its values manually for a single script and will do this now. You can also save your styles manually to a .mplstyle-file and load them at will. See Customizing plots with stylesheets.

You can load custom and builtin styles at will using plt.style.use() function. You can even load and combine several styles.

Below we will just override entries in the rcParams dictionary manually, so that this notebook is not dependent on external files.

In [11]:
# Overriding styles for current script
plt.rcParams['axes.grid'] = True
plt.rcParams['axes.prop_cycle'] = monochrome
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.bottom'] = False
plt.rcParams['axes.spines.left'] = False

Bar plots

In [12]:
fig, ax = plt.subplots(1,1)

for x in range(1,5):
    ax.bar(x, np.random.randint(2,10))

Now there are 3 problems with this barplot:

  1. The bars are colored
  2. The bars cannot be distinguished
  3. The grid is above the bars (will become a big problem when 1 is solved)

We will color all the bars white and leave the black border. To distinguish the bars using only monochrome colors, we will paint them with hatches - repeating patterns. To place the bars in front of the grid, we will set their zorder to something high.

More on hatches:

In [13]:
fig, ax = plt.subplots(1,1)

bar_cycle = (cycler('hatch', ['///', '--', '...','\///', 'xxx', '\\\\']) * cycler('color', 'w')*cycler('zorder', [10]))
styles = bar_cycle()

for x in range(1,5):
    ax.bar(x, np.random.randint(2,10), **next(styles))

Further reading / sources

In [ ]:
 

Algorithms sensitivity to single salient dimension

Posted on Fri 23 January 2015 in Notebooks

Sensitivity to 1 salient dimension

How different classifiers managers to sort through noise in multidimensional data

In this experiment I will test different machine learning algorithms sensitivity to data where only 1 dimension is salient and the rest are pure noise. The experiment tests variations of saliency against a number of dimensions of random noise to see which algorithms are good at sorting out noise.
For experiments performed here, there will be a 1-1 mapping between the target class in $y$ and the value of the first dimension in a datapoint in $x$. For example, for all datapoints belonging to class 1, the first dimension will have the value 1, while if the datapoint belongs to class 0, the first dimension will have the value 0.

In [10]:
#Configure matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%pylab

#comment out this line to produce plots in a seperate window
%matplotlib inline 
figsize(14,5)
Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib

Data

First the target vectors $y$ and $y_4$ are randomly populated. $y\in[0,1]$ and $y_4\in[0,1,2,3]$. The for each value in $y$ and $y_4$ a datapoint is generated consisting of the value of the target class, followed by 100 random values. This way the first column in the data matrix is equal to the target vector. Later this column will be manipulated linearly.

In [11]:
#initialize data

def generate_data():
    ''' Populates data matrices and target vectors with data and release it into the global namespace 
        running this function will reset the values of x, y, x4, y4 and r '''
    global x, y, x4, y4, r
    y = np.random.randint(2, size=300)
    y4 = np.random.randint(4, size=300)
    r = np.random.rand(100, 300)
    x = np.vstack((y,r)).T
    x4 = np.vstack((y4,r*4)).T

generate_data()

# note that x and y are global variables. If you manipulate them, the latest manipulation of x and y will
# used to generate plots.
    
split = 200
max_dim = x.shape[1]
m = 1
y_cor = range(m, max_dim)

print 'y is equal to 1st column of x:  \t', list(y) == list(x[:,0])
print 'y4 is equal to 1st column of x4:\t', list(y4) == list(x4[:,0])
print '\nChecking that none of the randomized data match the class values
print 'min:\t', r.min(), 'max:\t',r.max(), '\tThese should never be [0,1], if so please rerun.'
y is equal to 1st column of x:  	True
y4 is equal to 1st column of x4:	True
min:	5.75157395193e-05 max:	0.999952558364 	These should never be [0,1]

Visualizing the data

This section will plot parts of the data to give the reader a better understanding of its shape.

In [12]:
plt.subplot(121)
plt.title('First 2 dimensions of dataset')
plt.plot(x[np.where(y==0)][:,1],x[np.where(y==0)][:,0], 'o', label='Class 0')
plt.plot(x[np.where(y==1)][:,1],x[np.where(y==1)][:,0], 'o', label='Class 1')
plt.ylim(-0.1, 1.1) #expand y-axis for better viewing
legend(loc=5)

plt.subplot(122, projection='3d')
plt.title('First 3 dimensions of dataset')
plt.plot(x[np.where(y==0)][:,2],x[np.where(y==0)][:,1],x[np.where(y==0)][:,0], 'o', label='Class 0')
plt.plot(x[np.where(y==1)][:,2],x[np.where(y==1)][:,1],x[np.where(y==1)][:,0], 'o', label='Class 1')
Out[12]:

A clear seperation between classes is revealed when visualized. This clear seperation between the 2 classes remains, no matter how many noisy dimensions we add to the dataset, so in theory it is reasonable to expect any linear classifier to find a line that seperates the 2 datasets.

In [13]:
#Initialize classifiers

from sklearn.neighbors import KNeighborsClassifier as NN
from sklearn.svm import SVC as SVM
from sklearn.naive_bayes import MultinomialNB as NB
from sklearn.lda import LDA
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.linear_model import Perceptron as PT

classifiers = [NN(),NN(n_neighbors=2), SVM(), NB(), DT(), RF(), PT()]
titles = ['NN, k=4', 'NN, k=2', 'SVM', 'Naive B', 'D-Tree', 'R-forest', 'Perceptron']

# uncomment the following to add LDA
#classifiers = [NN(),NN(n_neighbors=2), SVM(), NB(), DT(), RF(), PT(), LDA()]
#titles = ['NN, k=4', 'NN, k=2', 'SVM', 'Naive B', 'Perceptron', 'LDA']
#m, y_cor= 2, range(m, max_dim)
In [14]:
# define functions
def run (x,y):
    '''Runs the main experiment. Test each classifier against varying dimensional sizes of a given dataset'''
    
    global score
    score = []
    for i, clf in enumerate(classifiers):
        score.append([])
        for j in range(m,max_dim):
            clf.fit(x[:split,:j],y[:split])
            score[i].append(clf.score(x[split:,:j], y[split:]))

def do_plot():
    ''' Generates the basic plot of results 
        Note:   Score is a global variable. The latest score calculated from run()
                will always be used to draw a plot '''
    
    for i, label in enumerate(titles):
        plt.plot(y_cor, score[i], label=label)
    plt.ylim(0,1.1)
    plt.ylabel('Accuracy')
    plt.xlabel('Number of dimensions')

def double_plot():
    ''' Runs the experiment for 2 classes and 4 classes and draws appropriate plots 
        Note:   x and y are global variables. The latest manipulation of these are
                always used to run the experiment. If you need 'original' x and y's
                you need to rerun generate_data() and use new randomized data '''
    
    plt.subplot(121)
    plt.title('Two classes')
    run(x,y)
    do_plot()
    plot([0,100], [0.5,0.5], 'k--') #add baseline
    plt.legend(loc=3)
    plt.subplot(122)
    plt.title('Four classes')
    run(x4,y4)
    do_plot()
    plot([0,100], [0.25,0.25], 'k--') #add baseline

Experiment 1

Test all classifiers against 2 class and 4 class datasets for 1 through 100 dimensions. Notice that Naive Base fails when the dataset is literally equal to the targets, but adding just a little bit of noise and it starts to work much better.

For 4 classes, Naive Base again starts of poorly, but while the other algorithms quickly succumb to the noisy dimensions, Naive Bayes seems to improve up until ~20 dimension, and though its performs starts to decline, it is still the best performer from there on out. If you are running the experiment with LDA, then the test will not be done for 1 dimension, and Naive Bayes weakest point won't show.

However, the Decision Tree is quick to find that one dimension explains everything, and has no trouble throughout either experiment. The random forest have some trees where the salient dimension has been cut off, so more noise and randomness is added to the results.

In [15]:
double_plot()

Experiment 2

In this experiment the first column of the data matrix is linearly manipulated in order to "hide" the values that map to the classes better amongst the noise. For 2 classes experiment the value 0.25 maps to class 0 and 0.75 maps to class 1. For the 4 class experiment, value:class mapping is now 1:0, 1.5:1, 2:2, 2.5:3, 3:4 This does not change the fact that there is a clear boundry between the classes. It just means the distance between the 2 planes seen in the visualization section is getting narrower.

In [16]:
plt.figure()
x[:,0] = (x[:,0]/2)+0.25
x4[:,0] = (x4[:,0]/2)+1

double_plot()

This experiment is quite sensitive to the randomness in the data. For two classes in general the SVM is the strongest until around the 40 dimension mark, where the Naive Bayes takes over. In the higher dimension area, NN often manages to overtake SVM, though this is somewhat dependent on the random data. It's still surprising, given that NN is usually the poster child for the curse of dimensionality. It is not that easy to hide linear explanation for to the Decision Tree, which clearly outperforms everything here. The tendency to overfitting is really helping.

Baseline test (random only)

In this final experiment the only salient datapoint in the observation data is removed to show the reader that this will attain baseline results. Also notice, depending on the data, the baseline for some of the algorithms can be as high as 60% accuracy. Keep this in mind when reviewing results from above.

In [17]:
x = x[:,1:]
x4 = x4[:,1:]

double_plot()
plt.legend(loc=1)
plt.subplot(121)
plt.legend().remove()