A Critique of

Elisabeth Gregori-Puigjané, Rut Garriga-Sust, Jordi Mestres

Indexing Molecules with Chemical Graph Identifiers

Journal of Computational Chemistry, 32, 2011, 2638-46

 

In the September 2011 issue of JCC (open access here), Mestres and his coworkers published a paper on a novel method to uniquely identify and index chemical structures based on an invariant of the structure graph.

The method involves the computation of a floating-point number from graph elements, without any precision and numerical stability analysis - but this fundamental design issue is not the point of this page. Rather, it discusses the presentation of examples in the paper obtained by competing methods.

To showcase results obtained by their software, the authors use a standard test data set from the old NCI cancer database (ref [33] in the paper, which is still available in the same version here). In the paper, the authors' software is compared with two popular structure key generators: Cactvs hashcodes (ref 23]) by Xemistry GmbH and NIST InChI keys (ref 25]). Both are indeed important: InChIKeys are for example extensively used in Wikipedia articles and online databases, while Cactvs hashcodes are the primary key for PubChem, but also in in-house registration systems of large pharma companies.

A demonstration of critical flaws in these established algorithms that are not present in a new approach therefore would be a major selling point of a paper. The comparison is especially facilitated because both institutions provide free access to their software. NIST allows anybody to download the full source code. From Xemistry, there is a free academic version of the Cactvs toolkit, which academic groups like that of Mestres are very welcome to use, even for benchmarks and direct algorithmic comparisons. So, the authors had an opportunity to publish results of three softwares side by side, and indeed took that path. They obtained copies of NIST and Xemistry software, and went to work.

In the paper, Mestres et al. prominently identify exactly three points or examples where in their opinion the Cactvs software fails or lacks features, while their approach supposedly yields superior results for these.

The problem is that none of these problems exist, neither in current software, nor in any software release over the past years, and of course such problems definitely do not exist in the software version used by the authors of the paper.

All these supposed problems were simply invented by the authors.

Since the authors properly identified the exact test data used in the paper, and academic/evaluation versions of the Cactvs toolkit are freely available for download, it is straightforward for anybody interested to verify this claim.

Here we go:

Problem 1: Claim on manuscript page 9, journal page 2646

"Just as a final remark, neither CACTVS nor InchIKey could generate the hash code for the biggest fragment in a chemical graph and thus a comparison scenario for graph ensembles could not be established.

This is simply untrue. Cactvs hashcodes of molecular ensembles are computed by mixing the hashcodes of fragments, so there is no ensemble hashcode without a full set of molecule/fragment hashcodes. This has even been described in detail in the cited reference. Of course the toolkit provides means to select a canonic primary fragment - both by explicitly scripted methods, and a built-in default algorithm. Below is some sample code, interactively entered into a standard toolkit interpreter:

cactvs>ens create {C[N+](C)(C)C.[Cl-]}
ens0
cactvs>ens get ens0 E_HASHISY
FC94C3FDFFE0A30F
cactvs>ens get ens0 M_HASHISY
DEE19057637289A9 227553AA9C922AA6
cactvs>mol get ens0 1 M_HASHISY
DEE19057637289A9
cactvs>mol get ens0 2 M_HASHISY
227553AA9C922AA6
cactvs>mol get ens0 primary M_HASHISY
DEE19057637289A9
cactvs>ens create {C[N+](C)(C)C} 
ens1
cactvs>ens get ens1 E_HASHISY
DEE19057637289A9
cactvs>ens get ens1 M_HASHISY
DEE19057637289A9

This short sequence of commands computes the ensemble and molecule hash codes of tetramethylammonium chloride, and selectively retrieves the fragment hashcodes of the canonical primary fragment. That fragment is the tetramethylammonium part, as verified by displaying the ensemble and molecule hashcodes of that fragment created without the counterion. In selecting the primary fragment, fragment size is the most important criterion - others are automatically used for tie-braking if necessary. There is an extensive manual for the toolkit which includes a section on handing molecular fragments, so this is not a hidden, undocumented feature.

Problem 2: Example in Figure 6a, manuscript page 8, journal page 2645

In order to follow this example, you will need the original records of NSC 224964 and NSC 266047. These are provided here for convenience as byte-exact copies of the records in question from the original datafile. Another important piece of information is that the NCI data file is supposed to be hydrogen-complete - so there should be no hydrogen addition, even if some structures in that data set are arguably in a questionable hydrogenation/protonation state. This is explicitly spelled out in the technical notes accompanying the file.

As can be seen from the published hashcode values, this was indeed handled correctly by the authors, at least in computing the Cactvs hashcodes. However, both depictions in the figure are incorrect. The table below has the original file data in the top row, and the second row shows the structure as drawn in the journal figure.

Figure 6a
NSC 224964 NSC 266047
NSC224964 correctStructure in file NSC266047 correctStructure in file
NSC224964 incorrectDrawn in journal NSC266047 incorrectDrawn in journal
Hashcode: 94F1F58CB40BA98B Hashcode: 94F1F58CB40BA98B (identical)

It is important to recognize that the original structures do not differ in hydrogen count or total charge, but only in the placement of localized formal charges within a single pi system. The published Cactvs hashcode is correct. It is indeed identical for both structures, and was properly computed for the structure form without the spurious extra hydrogens. This can be reproduced easily:

cactvs>molfile read nsc224964.sdf 
ens0
cactvs>ens new ens0 E_HASHY
94F1F58CB40BA98B
cactvs>molfile read nsc266047.sdf
ens1
cactvs>ens new ens1 E_HASHY
94F1F58CB40BA98B

It is also correct that the InChI encoder yields different InChIKeys for these structures. However, on closer inspection, this does not really make sense. Fundamentally, the two structure forms with different charge patterns are no different from writing a nitro group alternatively in ionic or pentavalent form. Nobody would claim that these are different structures, and thus computing a common Cactvs hashcode for nitro group variants as well as for this structure pair is the right thing to do. It is noteworthy that in general the InChI software strives to ignore such charge distribution pattern differences, too, as shown here for the example of nitromethane:

cactvs>ens create {C[N+]([O-])=O}
ens0
cactvs>ens get ens0 E_HASHY
90F32D28E6B5D9E6
cactvs>ens get ens0 E_STDINCHIKEY
InChIKey=LYGJENNIWJXYER-UHFFFAOYSA-N
cactvs>ens create {CN(=O)=O}
ens1
cactvs>ens get ens1 E_HASHY
90F32D28E6B5D9E6
cactvs>ens get ens1 E_STDINCHIKEY
InChIKey=LYGJENNIWJXYER-UHFFFAOYSA-N

The only reasonable conclusion is therefore that for the structures in Figure 6a, the InchI processor failed - together with the author's program. Among all tested softwares, Cactvs is actually the only one which got it right. Calling this an example of a "collision", as the authors do, is a bizarre statement.

Problem 3: Example in Figure 6d, manuscript page 8, journal page 2645

The final example is the most egregious misrepresentation.

fig6d

In order to analyze this example, we once more need the original records, this time for NSC 40295 and NSC 100317. The authors claim that they obtained different hashcodes for these two records which obviously encode the same compound, just with different atom ordering, and different 3D coordinates.

To get to the point directly, let's check this:

cactvs>molfile read nsc40295.sdf
ens0
cactvs>ens new ens0 E_HASHISY
88D0A4468C580DA3
cactvs>molfile read nsc100317.sdf
ens1
cactvs>ens new ens1 E_HASHISY
88D0A4468C580DA3

The hashcode for the first compound looks familiar - this is the same as in the paper. But the hash code for the structure in the second file, which in the paper is claimed to be different for a structure with an identical 3D configuration - a really severe algorithmic fault indeed if verified - turns out to be identical after all, and is not the one found in the paper. There is no verifiable bug of the Cactvs software here- and no; there really is no way to reproduce the published problem with the original data file as referenced in the paper.

Unless someone, in an underhand operation not documented at all, tampers with the file in such a way that they end up with a syntactically illegal Molfile header line in the records, with an explicit dimension code which is neither 2D, nor 3D, or 0D, nor anything else the software knows about. In that case, it was possible to trick it into a false classification of the input dimensionality, and if the 3D data in the file is read as as 2D, ignoring the z coordinate column, one ends up with a different stereochemistry and from that with a different stereo hashcode.

The author's input routine and the standard InChI Molfile reader, are much more limited in their capability to interpret of the finer points of the SDF format than the one in the Cactvs toolkit. They simply skip the header, without any attempt of analysis, and always read all three coordinate columns.

The gist of this input data modification is that the authors contrive a point with a corrupt input file, which their software does not even read properly, and then they conveniently neglect to mention how they butchered the original file to get this to work.

We think this type of undocumented data manipulation is absolutely incompatible with good scientific practice. And as a general view, we think using corrupt input data to make a point about algorithmic performance is really an ill-advised methodology.

The ultimate problem

We contacted the corresponding author Dr. Mestres (jmestres@imim.es) immediately after the publication became accessible, and asked for these points to be corrected. However, after an initial short period of constructive communication, where we for example learned what happened during data preparation for the third problem, Dr Mestres has now ceased responding to our emails and considers, as per his last message, the issue closed. He has refused to retract the paper, or issue a correction himself. He, in his own words, "does not think this is important".

We absolutely disagree with this stance. Publishing demonstrably incorrect information about a competing approach is clearly bad style. This is especially true when it is trivial to show that the reported results are not reproducible or obviously interpreted wrongly. Free access to our software was given under the assumption of its use for a fair and well-researched comparison. In combination with the undocumented input data manipulation for the third problem, this position is, in our view, incompatible with ethical publishing.

A letter to the editors of the Journal of Computational Chemistry is in preparation. Given the limited number of pages available for the publication of such letters, even if it were accepted, this Web page was set up to present a more detailed analysis of the misrepresentations in the paper. We also hope that it will help to prevent the flawed publication to be cited in unchecked fashion as an authoritative source for documented failures of our algorithms. The Cactvs hash code structure keys are an important commercial product, and irresponsible publications such as the one by Mestres have the potential to be highly damaging.

 

Wolf-D. Ihlenfeldt, August 2011