CredoDB

A Structural Interactomics Database For Drug Discovery.

CIFStore: Document Store for mmCIF Files

CIFStore is an application to create and manage a database of CIF files without hassle. It uses MongoDB, a open-source, non-relational database system. mmCIF files from the PDB are difficult to to store in a classical relational database because the used schema is not easy to keep up to date and the data is often quite messy. MongoDB as a document-oriented database, stores whole documents (obviously) instead of tables with columns. The great advantage here is that it does not require a predefined schema with table and column definitions - in addition, CIF files can easily be converted into JSON documents which MongoDB requires. So what does CIFStore do? The application can parse CIF files into JSON documents and insert them into the MongoDB database. A new collection is created for each directory by default and indexes are created automatically if defined in the configuration. The example configuration is ready to create collections for PDB structure and chemical component CIF files.

The structures collection containing mmCIF PDB structures can be created with:

Create a MongoDB collection with CIFStore
1
$ python cifstore.py db add --collection=structures --strict --debug --clean

Querying the database with Python

MongoDB has drivers for range of programming of languages. The default language is Javascript, which should make is very easy to access CIFStore directly from within an web application.

Accessing CIFStore in Python is straightforward with PyMongo (the driver).

Launching queries is also very easy. The following query ‘finds’ the entry 2P33 and returns the comp_id of all non-polymers. Note how the queried (nested) attributes directly correspond to the CIF format (e.g. pdbx_entity_nonpoly.comp_id).

The next query finds all entries that contain the non-polymer STI (Imatinib) and returns UniProt accessions of all the polymer entities in the structure (if any).

Much more sophisticated queries are possible as well, including MapReduce. Please refer to the MongoDB/PyMongo documentation for more information.

Obtaining CIFStore

CIFStore is released under the MIT license and can a development version can be downloaded from Bitbucket.

Mission Accomplished: All PDB Entries in CREDO Now

I finally added the last batch of structures to the database last week. This is also a very good opportunity to introduce the new blog. CREDO now contains 75,373 PDB entries with 86,257 biological assemblies. Structural interactions: 222,858 protein-protein interfaces, 92,046 ligand binding sites (at least 7 heavy atoms and in contact with at least 7 residues) and finally 11,545 protein-oligonucleotide grooves. All these interactions are based on a whopping 904,730,195 interatomic contacts.

All entities in CREDO are also mapped to ChEMBL where possible. The database contains 2,020 unique ChEMBL drug targets, and 18 unique protein therapeutics. 11,059 chemical components could be mapped to ChEMBL as well.

During the last couple of weeks I also completely refactored the various database generation scripts into a single command-line application called credovi, which is also capable of generating contact data in various formats (.csv, .xlsx, .json) and PyMOL visualisation scripts for in-house data. I am now spending more time on credimus, the CREDO web interface - but more on that in a future blog post.

Speeding Up Fingerprint Similarity Searches in the OpenEye PostgreSQL Extension With GIST Indexes

PostgreSQL has a unique feature, namely support for GIST (or Generalized Search Tree) indexes, which unlike others, are completely agnostic about the data type or the queries being used. The GIST API requires only seven functions to define a new index for a data type and implementing those for binary fingerprints was pretty straightforward; at least after spending a whole weekend on it. Prior to this however, I created a new data type for OpenEye fingerprints in PostgreSQL and added all the other functions (and operators) necessary for screening, i.e. input/output methods, fingerprint-generating functions (Path, Circular, Tree, MACCS166) and all GraphSim TK similarity metrics (Cosine, Dice, Euclidean, Manhattan, Tanimoto, Tversky).

KNN-GIST Operators for the RDKit PostgreSQL Cartridge

The latest version of PostgreSQL (9.1) introduced an extension to GIST called KNN-GIST. The great advantage of KNN-GIST is that the rows to be returned are already sorted in the required order and the N-nearest neighbours of a query are already known (used with LIMIT). I have now written the necessary code to add KNN-GIST ORDER BY operators to the PostgreSQL cartridge (currently only bfp data type). The two queries below together with their plans show the difference. In the first the query is executed without KNN-GIST and the query plan shows that the rows returned by the index have to be sorted first in order to get the 10 most similar.

ChEMBLdb Integration

One of my goals for CREDO is to implement an (almost) seamless transition between the structural interactions in CREDO and the activity data from ChEMBLdb. Today I finished an extension for credoscript that extends the ORM to include the ChEMBL database schema and a lot of other goodies. The ChEMBL extension is not stand-alone and will neither work on MySQL nor Oracle. There is however pychembl, which works with the former. Here are a couple of code examples:

Protein-ligand Complexes and Buried Surface Areas

The buried accessible surface area of a protein-ligand complex is known to be linked to thermodynamic parameters. Olsson et al. for example investigated the correlation between the change in solvation and thermodynamic properties of protein-ligand complexes. One observation was that synthetic ligands, compared to biological ligands, gain affinity mostly through more favourable entropy changes upon binding, i.e. burial of apolar surface area (hydrophobic interactions). Endogenous ligands on the other hand normally gain more affinity (which can vary a lot) through polar interactions, whereas synthetic, exogenous ligands are much more limited in this aspect because of severe ADME restraints. the relationship between binding and buried surface area is also very relevant in the context of the disruption of protein-protein interactions, where synthetic, comparatively small molecules have to compete with large polypeptide chains that form many polar interactions.

Source Code for PostgreSQL Eigen Extension Now Publicly Available

The source code for the PostgreSQL Eigen extension is now publicly available on bitbucket. The extension requires Eigen 3.x and PostgreSQL 9.x. It is at a fairly early stage and does not come with documentation but if you are familiar with extending PostgreSQL and the Eigen template library then the code should be pretty self-explanatory and easy to extend. The source code of the pgeigen extension is released under the MIT license.

ChEMBL10 Database Schema for PostgreSQL

I have exported the schema of our PostgreSQL ChEMBL10 database that can be viewed or downloaded from here. There is also a complete binary dump that I wanted to put into the public domain once I have found a suitable host - if you know one, please let me know.

Poor Man’s LINGO With the PostgreSQL Pg_trgm Extension

LINGO is a method used in cheminformatics to compare molecules with the help of their canonical SMILES strings. The algorithm works by fragmenting SMILES strings into overlapping substrings of a defined size. The resulting LINGO profile can then be compared with others to determine chemical similarity or even physicochemical properties. Interestingly, there is an extension for PostgreSQL called pg_trgm that provides functions and operators for determining the similarity of text based on trigram matching. More importantly, it also comes with different index operator classes including the latest KNN-GIST to speed up similarity searches. This approach is not as sophisticated as other methods that set all ring numbers in a SMILES string to 1 for example, but the results are nevertheless very promising. The clear advantage is that cheminformatics routines are not required and that it is extremely fast - the query shown above takes less than 200ms to return the top 20 hits out of more than 650.000 SMILES strings on a slow database server.

The hits from the query using the ChEMBL database are shown below using the new SVG functionality in Open Babel.