For Data Engineers

Tanimoto and Chemical Similarity in FeatureBase

NOTE: A revised version of this post has been published as of 6/20/23. It expands on the ideas contained here for application in text similarity functions for large language models.

As we test FeatureBase against various types of data (e.g., the Billion Taxi Ride Dataset), we decided to turn our next endeavor into a dual-purpose test. Specifically, we decided to test FeatureBase against the Chemical Similarity for ChemBL dataset while also testing FeatureBase'’s ability to run algorithms in its native core.

The notion of chemical similarity (or molecular similarity) is one of the most important concepts in cheminformatics. It plays an important role in modern approaches to predicting the properties of chemical compounds, designing chemicals with a predefined set of properties and, especially, in conducting drug design studies by screening large databases containing structures of available or potentially available chemicals. One commonly used algorithm to calculate these measures of similarity is the Tanimoto algorithm. The resulting Tanimoto coefficient is fingerprint-based, encoding each molecule to a fingerprint “bit” position, with each bit recording the presence (“1”) or absence (“0”) of a fragment in the molecule. Given the binary format, we determined it would be a perfect fit to test against FeatureBase.

In this post, we’ll look at the problem of finding chemical similarity and how, using FeatureBase, we can achieve performance that is 3x faster than existing state-of-the-art systems (see MongoDB, below) and 10x faster than our initial naive FeatureBaseimplementation. As solutions to the chemical similarity search problem have been attempted by other individuals using different technologies (tested on MongoDB by Matt Swan, or on [PostgreSQL] (http://blog.rguha.net/?p=1261) by Rajarshi Guha), we knew we would be able to compare our results to existing solutions.

Overview of Tanimoto

The Tanimoto algorithm states that A and B are sets of fingerprint “bits” within the fingerprints of molecule A and molecule B. AB is defined as the set of common bits of fingerprints of both molecule A and B. The resulting Tanimoto coefficient (or T(A,B)) ranges from 0, when the fingerprints have no bits in common, to 1, when the fingerprints are identical. Thus,

T(A,B) = (A ∩ B)/(A + B - A ∩ B)

The chemical similarity problem then becomes, Given molecule A, find all formulas that have a Tanimoto coefficient greater than a given threshold. The greater the value of a set threshold, the more similar the molecules are.

FeatureBase

To model data in FeatureBase, one must choose what the columns represent and what the rows represent. We chose to let each column represent a bit position ID within the digital fingerprint and each row represent a molecule ID.

FeatureBase stores information as a series of bits, and in this case one can use RDKit in Python to convert molecules from their SMILES encoding to Morgan fingerprints with 4,096 bits size, which are arrays of “on” bit positions. As a benchmarking measure, we also followed Matt Swain’s instructions to import the same data set into MongoDB with the same Morgan fingerprint size.

We were able create a data model in FeatureBase like so:

It took 21 minutes to export chembl_id and SMILES from SDF to csv, and took ~3 minutes to import 1,678,343 rows into FeatureBase on a MacBook Pro with a 2.8 GHz 2-core Intel Core i7 processor, memory of 16 GB 1600 MHz DDR3, single host cluster.

Naive Approach

The first approach we took after successfully loading the data into FeatureBase was rather naive. We created a Python script that used a FeatureBase client and sent requests to fetch the following values,

  • Count of A
  • Intersection A ∩ B
  • Count of B

The last two requests are batched requests for 100,000 rows. We then calculated the Tanimoto coefficient on the client.

While this approach is simple, its performance is suboptimal due to the network latency required in sending all these requests. Using this setup, FeatureBase didn’t compare favorably with MongoDB at all. While we beat MongoDB at lower thresholds, our performance didn’t improve much at higher thresholds. MongoDB has an advantage in that its aggregation framework can perform all calculation tasks at the server.

SCHEDULE A DEMO