Blog

Signals Blog

Foundational Cheminformatics Software and Why Chemistry May Need Something Better

Chemistry is peculiar among scientific fields due to its use of the molecule as an organizing principle. This gives chemistry an expressive, mostly regular, often graphical language at the core of the discipline. Although apparently simple on the surface, the concept of the molecule is deep with subtleties that can take years to appreciate. Chemists in different subdisciplines (e.g., analytical, organic, bioorganic) may find it hard to communicate at times because their common molecular language blinds them to critical nuances.

Computers know nothing about this language of molecules that chemists use. To develop even the most rudimentary application in chemistry requires a foundational cheminformatics software layer that enables the easy use of molecular concepts within a programming environment.

Software developers need better foundational tools to create applications that will empower chemists to make new discoveries. This article describes one approach.

Toolkits, Oh We Have Toolkits

Most of the time, foundational cheminformatics software is referred to as a 'cheminformatics toolkit'. The purpose of this software is to provide something akin to a domain specific language that defines and operates on molecular structure representations. The toolkit makes it possible to develop a chemistry application in a given programming environment. No toolkit, no application.

Cheminformatics toolkits can be divided into roughly two categories: (1) those distributed by commercial organizations; and (2) those distributed by informal academic collaborations and nonprofits.

Lock-In

Some kinds of software are especially 'sticky'. Once you start using them, it's difficult to stop because they become so intertwined with the work you do or the products you distribute. There was a time when Microsoft Office was sticky because it was the only way to share documents. If you've ever developed a Web or desktop application, then you know how incredibly sticky Web frameworks and UI toolkits can be.

Cheminformatics toolkits are among the stickiest of software packages in chemistry. They alone stand between a programming environment and a lot of application code. So a lot of code gets written to the toolkit's application programming interface (API). Coupled with the time and effort of development staff to learn a toolkit's quirks and nuances, it's not hard to see how an organization's investment in a cheminformatics toolkit can be enormous.

Lock-in is not necessarily a bad thing. But it can become a problem when the goals of the toolkit provider don't match your goals.

Commercial Vendor Lock-In

The advantages of using a toolkit provided by a traditional vendor are that the vendor takes care of the maintenance and upkeep of the toolkit. This lets your development staff focus on building applications.

The flip-side of this tight control can be restrictions on how your application works or how it's distributed. For example, to convert your successful desktop application into a Web application may require a major license renegotiation, new fees, and possibly unsavory restrictions on distribution. Similar considerations would apply to taking an application in a mobile direction, or rebuilding the toolkit to run in a hardware or software environment - or programming language - not supported by the vendor. Likewise, an API that wasn't sufficiently flexible would require waiting (and hoping) for the vendor to upgrade.

Limitations on redistribution, recompilation, and modification are some of the tradeoffs made by application developers who base their products on traditionally-licensed commercial cheminformatics toolkits.

Open Source

Given that toolkit lock-in is inevitable when developing chemistry software, some have opted to use open source options.

However, unlike traditionally-licensed cheminformatics software, nobody is responsible for responding to bug reports, new feature requests, or lack of documentation. Furthermore, open source licenses vary significantly in the freedoms allowed to application developers. The GPL and LGPL in particular both place restrictions on commercial vendors that may ultimately be bad for business. Compounding the problem is the fact that sometimes copyright to the source code resides, not with a single entity with whom better terms can be negotiated, but with many individual contributors - some of whom may no longer even be traceable.

The problem is that the organizations or individuals behind open source cheminformatics toolkits are poorly equipped to deal with the demands of a commercial organization developing products in a competitive environment.

Why We Chose None Of The Above

At Metamolecular, we've developed our own cheminformatics toolkit rather than be subject to vendor or open source lock-in. Our two products, ChemWriter and ChemVector both rely on proprietary cheminformatics toolkits we've developed and refined over the last few years.

In addition to these activities, we've worked with other companies to build cheminformatics systems and services. Each of these transactions have helped refine our cheminformatics toolkit by overcoming real-world problems.

We've also built a number of concept products that have further improved the understanding of what a toolkit needs to do and how it needs to do it.

Maintaining and growing our own cheminformatics toolkit has enabled us to achieve goals that neither an open source nor traditional commercial vendor arrangement would have allowed.

Consolidation

Our products and services have been written in two main languages: Java and JavaScript. These codebases have remained separate due to the fundamentally different nature of each language and its associated runtime environment.

But in the last few months, it's become apparent that this can't continue. For example, some ChemWriter customers have requested SMILES input and output. The technical challenges of maintaining separate codebases in Java and JavaScript for such code made the request impossible to fill in a cost-effective way.

Fortunately, we've found a solution that will enable us to maintain a single Java-based codebase for all of our toolkit functionality. The key to making this work is a new Java-to-JavaScript source translation library that was developed in-house and which now works reliably.

We are now consolidating all code essential for a modern cheminformatics platform into a single Java codebase. JavaScript will be generated for products like ChemWriter and ChemVector through source code translation. Other uses will also be possible given some of the tools that have been recently developed to cross-compile Java to fast native binaries and other runtimes and platforms.

ChemCore™

Our cheminformatics platform is called ChemCore. Its foundation consists of a chemistry model designed with the future in mind. Around this has been added a layer of efficient and flexible graph operations:

Higher-level capabilities already or will include:

Commercial Open Source

Given the above survey of the cheminformatics software landscape and the trouble we had with the current options, is it possible we aren't alone?

Had a commercial software vendor offered a platform like ChemCore under an open source license and with first-rate technical support, we would have used that option. But this solution may not be attractive to many others.

Does the world need a better cheminformatics toolkit offering greater flexibility to application developers combined with dedicated technical support? Or are the existing options working well enough?

If you'd like to discuss these questions, please feel free to drop me a line.