Principal component analysis of a trajectory

Here we compute the principal component analysis of a trajectory.

Last updated: February 2020

Minimum version of MDAnalysis: 0.17.0

Packages required:

Optional packages for visualisation:

[1]:
import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD
from MDAnalysis.analysis import pca, align
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import nglview as nv
import warnings
# suppress some MDAnalysis warnings about writing PDB files
warnings.filterwarnings('ignore')
%matplotlib inline

Loading files

The test files we will be working with here feature adenylate kinase (AdK), a phosophotransferase enzyme. ([BDPW09]) The trajectory DCD samples a transition from a closed to an open conformation.

[2]:
u = mda.Universe(PSF, DCD)

Principal component analysis

Principal component analysis (PCA) is a statistical technique that decomposes a system of observations into linearly uncorrelated variables called principal components. These components are ordered so that the first principal component accounts for the largest variance in the data, and each following component accounts for lower and lower variance. PCA is often applied to molecular dynamics trajectories to extract the large-scale conformational motions or “essential dynamics” of a protein. The frame-by-frame conformational fluctuation can be considered a linear combination of the essential dynamics yielded by the PCA.

In MDAnalysis, the method is as follows:

  1. Optionally align each frame in your trajectory to the first frame.

  2. Construct a 3N x 3N covariance for the N atoms in your trajectory. Optionally, you can provide a mean; otherwise the covariance is to the averaged structure over the trajectory.

  3. Diagonalise the covariance matrix. The eigenvectors are the principal components, and their eigenvalues are the associated variance.

  4. Sort the eigenvalues so that the principal components are ordered by variance.

Note

Principal component analysis algorithms are deterministic, but the solutions are not unique. For example, you could easily change the sign of an eigenvector without altering the PCA. Different algorithms are likely to produce different answers, due to variations in implementation. MDAnalysis may not return the same values as another package.

Warning

For best results, your trajectory should be aligned on your atom group selection before you run the analysis. Setting align=True will not give correct results in the PCA.

[3]:
aligner = align.AlignTraj(u, u, select='backbone',
                          in_memory=True).run()

You can choose how many principal components to save from the analysis with n_components. The default value is None, which saves all of them. You can also pass a mean reference structure to be used in calculating the covariance matrix. With the default value of None, the covariance uses the mean coordinates of the trajectory.

[4]:
pc = pca.PCA(u, select='backbone',
             align=False, mean=None,
             n_components=None).run()

The principal components are saved in pc.p_components. If you kept all the components, you should have an array of shape \((n_{atoms}\times3, n_{atoms}\times3)\).

[5]:
backbone = u.select_atoms('backbone')
n_bb = len(backbone)
print('There are {} backbone atoms in the analysis'.format(n_bb))
print(pc.p_components.shape)
There are 855 backbone atoms in the analysis
(2565, 2565)

The variance of each principal component is in pc.variance. For example, to get the variance explained by the first principal component:

[6]:
pc.variance[0]
[6]:
4203.190260100008

This variance is somewhat meaningless by itself. It is much more intuitive to consider the variance of a principal component as a percentage of the total variance in the data. MDAnalysis also tracks the percentage cumulative variance in pc.cumulated_variance. As shown below, the first principal component contains 90.3% the total trajectory variance. The first three components combined account for 96.4% of the total variance.

[7]:
print(pc.cumulated_variance[0])
print(pc.cumulated_variance[2])
0.9033944836257074
0.9642450284513093
[8]:
plt.plot(pc.cumulated_variance[:10])
plt.xlabel('Principal component')
plt.ylabel('Cumulative variance');
../../../_images/examples_analysis_reduced_dimensions_pca_15_0.png

Visualising projections into a reduced dimensional space

The pc.transform() method transforms a given atom group into weights \(\mathbf{w}_i\) over each principal component \(i\).

\[\mathbf{w}_i(t) = (\mathbf{r}(t)-\mathbf{\overline{r}}) \cdot \mathbf{u}_i\]

\(\mathbf{r}(t)\) are the atom group coordinates at time \(t\), \(\mathbf{\overline{r}}\) are the mean coordinates used in the PCA, and \(\mathbf{u}_i\) is the \(i\)th principal component eigenvector \(\mathbf{u}\).

While the given atom group must have the same number of atoms that the principal components were calculated over, it does not have to be the same group.

Again, passing n_components=None will tranform your atom group over every component. Below, we limit the output to projections over 5 principal components only.

[9]:
transformed = pc.transform(backbone, n_components=5)
transformed.shape
[9]:
(98, 5)

The output has the shape (n_frames, n_components). For easier analysis and plotting we can turn the array into a DataFrame.

[10]:
df = pd.DataFrame(transformed,
                  columns=['PC{}'.format(i+1) for i in range(5)])
df['Time (ps)'] = df.index * u.trajectory.dt
df.head()
[10]:
PC1 PC2 PC3 PC4 PC5 Time (ps)
0 118.408403 29.088239 15.746624 -8.344273 -1.778052 0.0
1 115.561875 26.786799 14.652497 -6.621619 -0.629777 1.0
2 112.675611 25.038767 12.920275 -4.324424 -0.160540 2.0
3 110.341464 24.306985 11.427097 -3.891525 -0.173275 3.0
4 107.584299 23.464155 11.612103 -2.293161 -1.500821 4.0

There are several ways we can visualise the data. Using the Seaborn’s PairGrid tool is the quickest and easiest way, if you have seaborn already installed.

Note

You will need to install the data visualisation library Seaborn for this function.

[11]:
import seaborn as sns

g = sns.PairGrid(df, hue='Time (ps)',
                 palette=sns.color_palette('Oranges_d',
                                           n_colors=len(df)))
g.map(plt.scatter, marker='.')
[11]:
<seaborn.axisgrid.PairGrid at 0x12b94bac8>
../../../_images/examples_analysis_reduced_dimensions_pca_21_1.png

Another way to investigate the essential motions of the trajectory is to project the original trajectory onto each of the principal components, to visualise the motion of the principal component. The product of the weights \(\mathbf{w}_i(t)\) for principal component \(i\) with the eigenvector \(\mathbf{u}_i\) describes fluctuations around the mean on that axis, so the projected trajectory \(\mathbf{r}_i(t)\) is simply the fluctuations added onto the mean positions \(\mathbf{\overline{r}}\).

\[\mathbf{r}_i(t) = \mathbf{w}_i(t) \times \mathbf{u}_i + \mathbf{\overline{r}}\]

Below, we generate the projected coordinates of the first principal component. The mean positions are stored at pc.mean.

[12]:
pc1 = pc.p_components[:, 0]
trans1 = transformed[:, 0]
projected = np.outer(trans1, pc1) + pc.mean
coordinates = projected.reshape(len(trans1), -1, 3)

We can create a new universe from this to visualise the movement over the first principal component.

[13]:
proj1 = mda.Merge(backbone)
proj1.load_new(coordinates, order="fac")
[13]:
<Universe with 855 atoms>
[14]:
view = nv.show_mdanalysis(proj1.atoms)
view

If you have nglview installed, you can view the trajectory in the notebook. Otherwise, you can write the trajectory out to a file and use another program such as VMD. Below, we create a movie of the component.

[15]:
from nglview.contrib.movie import MovieMaker
movie = MovieMaker(view, output='pc1.gif', in_memory=True)
movie.make()

pc1 gif

References

[1] Oliver Beckstein, Elizabeth J. Denning, Juan R. Perilla, and Thomas B. Woolf. Zipping and Unzipping of Adenylate Kinase: Atomistic Insights into the Ensemble of Open↔Closed Transitions. Journal of Molecular Biology, 394(1):160–176, November 2009. 00107. URL: https://linkinghub.elsevier.com/retrieve/pii/S0022283609011164, doi:10.1016/j.jmb.2009.09.009.

[2] Richard J. Gowers, Max Linke, Jonathan Barnoud, Tyler J. E. Reddy, Manuel N. Melo, Sean L. Seyler, Jan Domański, David L. Dotson, Sébastien Buchoux, Ian M. Kenney, and Oliver Beckstein. MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations. Proceedings of the 15th Python in Science Conference, pages 98–105, 2016. 00152. URL: https://conference.scipy.org/proceedings/scipy2016/oliver_beckstein.html, doi:10.25080/Majora-629e541a-00e.

[3] Naveen Michaud-Agrawal, Elizabeth J. Denning, Thomas B. Woolf, and Oliver Beckstein. MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. Journal of Computational Chemistry, 32(10):2319–2327, July 2011. 00778. URL: http://doi.wiley.com/10.1002/jcc.21787, doi:10.1002/jcc.21787.