This is Signals, a blog about chemistry and software by Metamolecular, LLC.

Substructure Search for Websites

May 22nd, 2013

Structure search is a frequent request among users of a website hosting chemical collections. If you run such a site and would like to make it much easier to find chemicals by exact structure and substructure, how would you do it?

This article presents a high-level discussion of the main problems and solutions for adding structure search capabilities to an existing website.

Content Management Systems

You may have used an off-the-shelf content management system (CMS) to build your site. These products are excellent for building general-purpose websites, but they were never designed for scientific applications like substructure search. Before you even begin, integration will be a major consideration.

CMSs differ greatly in their extensibility. For example, one popular ecommerce solution, Magento, exposes a plugin interface, with a large number of pre-built modules available. Other CMSs offer similar options. So one solution could be to buy or build a CMS plugin.

Another option might be to extend the CMS more directly. However, this approach is only viable if the CMS was released under an Open Source license and is under your direct control (not hosted by a third party).

Either option, a CMS plugin or direct extension, is likely to require deep integration into the database level. The easier a CMS makes it to alter its database and add new behaviors, the higher the chances of success.

To the extent that a complete site redesign is impractical, your initial choice of CMS will greatly constrain the options for implementing substructure search.

Substructure Search Systems

Substructure search systems are widely available. For collections of fewer than 100,000 structures, flexibility, not performance, will be the deciding factor. The importance of flexibility becomes most apparent when linking a substructure search system with a CMS on a server.

The software environments running popular CMSs are not generally the same ones used to develop substructure search systems. For example, many CMSs are written in PHP, a language that by itself is incompatible with the languages in which most substructure search software is written. A separate interface is required.

Substructure search systems have mainly been written in Java, C/C++, and Python. Although commercial products exist, Open Source alternatives will work well in many situations. They include:

One integration approach involves deploying the substructure search software into the database system via a “chemistry cartridge”. The CMS can then interact with the chemistry software though SQL, a well-supported interface. Alternatively, substructure search functionality can be exposed through a local network interface or other inter-process communication methods.

A more detailed look at the implementation details of substructure search is available in a separate article.

Data Conversion

In addition to software requirements, implementing a substructure search system for your website will make new demands on your data.

Many chemical collections start out as lists of Chemical Abstracts Service (CAS) numbers and/or trivial/systematic names. These text identifiers by themselves are insufficient for substructure search. What’s needed is a way to convert these text identifiers into a machine-readable chemical structure representation.

The goal of your data migration will most likely be to generate an SD File. An SD File combines machine-readable chemical structures with data about those structures. For example, an SD File for a chemical supplier might associate a catalog number with quantity/pricing information for each structure.

The options for converting CAS numbers and names to chemical structures are much more limited than for substructure search. The main problem in converting CAS numbers is that the only authoritative source for CAS to structure mappings is Chemical Abstracts Service itself. The expense of this service prevents many from using it. But given that the right steps are taken, large numbers of accurate CAS to structure conversions can be made automatically by using public datasets such as PubChem.

Systematic names can be converted to chemical structures by software. One free option that works very well in most cases is OPSIN. Commercial name to structure packages are also available.

Using name to structure software and possibly a dictionary derived from one or more public datasets, a spreadsheet of names and CAS numbers can be compiled to an SD file. Although the cleanliness of the spreadsheet data and the accuracy of the conversion process may be quite high, a manual check of the results is highly recommended.

The result of this data conversion is an SD File that can be used to create a structure searchable database.

Putting It All Together

To get the most out of your substructure search system, it’s essential to develop a reasonably accurate picture of how people will use it.

A search typically begins with a chemist drawing a structure. How will your site make this possible? A chemical structure editor is the first interaction most chemists will have with your site. A number of browser-based editors are available. One of them, ChemWriter, is available from Metamolecular.

What happens after a chemist submits a search structure? Clearly, a list of chemicals will be returned, but in what format? Will you display structures and data together or just one of the two? Will you provide a link to a detail view for each result? If you’re hosting a chemical catalog, do you want to show customers a “similar products” sidebar?

Some applications will require more power than others. How important will advanced features be to your users? For example, similarity search might be very helpful in some situations, but less so in others. Likewise, query atom capability could mean the difference between a useful application one that raises complaints.

These are just some examples of the things to consider when designing a structure search user interface.

Custom Site

Given the significant integration issues inherent in extending a general-purpose CMS with substructure search combined with the centrality of chemical structures to chemistry, another option might be to rebuild the site with integrated chemical structure handling capabilities.

Such an option makes the most sense when you know the site will be doing more than just substructure search. For example, you might “publish” new products directly to your catalog through a chemically-aware admin interface.

The biggest downside is, of course, the relatively high up-front cost. But being able to recapture these costs through increased revenue or cost reduction in other areas of the business might make a custom site with integrated chemical capabilities an attractive alternative.

Web Services

Web services offer yet another approach to integrating structure search. In this case, a remote web server would host your structure collection and the search software. Communication with your site would take place, not at the level of databases and CMS behavior, but at the level of the browser.

The main problems with this approach are the relatively poor quality integration provided by existing solutions and their inflexibility. As with a CMS extension, data conversion would still likely be a necessary step. But for projects on tight budgets, the web services route might be good enough.

Conclusions

Enabling substructure search on a website is possible, but not trivial. A key consideration in adapting an existing site is integration with third-party software. Even with all of the components in place, data conversion will likely be necessary. User interface considerations and workflow become most visible after the low-level work has been done, but should really be addressed up-front. A custom site may be the best option when all factors are considered.

Despite the costs and difficulties, adding substructure search can be a game-changer.

Metamolecular has worked with multiple companies seeking to build structure-searchable websites. If you’re interested in doing the same, please feel free to contact me for a free consultation.

Resources

How to Balance Any Chemical Equation

February 20th, 2013

The art of balancing chemical equations is taught very early chemistry degree programs, and understandably so. Correctly balancing a chemical equation is the first step in a great number of chemistry problems including reaction setup, percentage yield determination, and equilibrium constant calculations, among others.

Although the most popular method, “balancing by inspection”, works in simple cases, a large number of exceptions and traps makes this method frustrating to learn and difficult to apply to even moderately complex equations. For practical purposes, many equations simply can’t be balanced by inspection.

What if there were a systematic method for balancing any chemical equation, regardless of complexity?

A paper published by Lawrence Thorne in 2010 describes such a method. This matrix-based approach balances a large number of equations that can’t be balanced by inspection, or even other matrix approaches. An introduction to using this method is given in the video presentation (slidedeck here).

Although operationally simple, Thorne’s method does require a lot of arithmetic, which can become tedious. Matrix algebra can be done on many scientific calculators or spreadsheets, but setup requires technical skill and data entry is lengthy at best.

For those so inclined, ReactionMate is an iOS app that offers a convenient user interface for Thorne’s method.

ChemWriter Keyboard Shortcuts for Faster Structure Drawing

February 6th, 2013

ChemWriter is a chemical structure editor that can be embedded into web pages. One of our main goals was to make chemical structure entry as fast as it can be. Toward this end, a number of keyboard shortcuts were built in.

Keyboard shortcuts are accessed by hovering the mouse cursor over an existing atom and pressing a key, either with or without the shift key. All ChemWriter keyboard shortcuts are listed below:

  • a: Benzene (aromatic) ring.
  • b: Boron atom.
  • c: Carbon atom.
  • f: Fluorine atom.
  • h: Hydrogen atom.
  • i: Iodine atom.
  • l: Chlorine atom.
  • n: Nitrogen atom.
  • o: Oxygen atom.
  • p: Phosphorous atom.
  • r: Bromine atom.
  • s: Sulfur atom.
  • t: Tin atom.
  • z: Silicon atom.
  • delete/backspace: Delete currrent atom or selection.

Product Preview: ChemWriter App for iPad

January 22nd, 2013

Within three short years, tablet computers have gone from clumsy gadgets few cared about or bought to workhorses rapidly replacing laptops and desktops. Key to this transformation has been a booming app economy. Although many software categories such as entertainment, business, and time management have been inundated with new apps, niche ares such as chemistry research and advanced education have experienced a more muted uptake.

Chemical structure editors are essential tools in modern organic chemistry and related fields. This article offers a preview of a new chemical structure editor app for iPad® devices now under development at Metamolecular.

About ChemWriter

ChemWriter® is a chemical structure editor originally built for use on Web pages. As such, it’s a tool for software developers. ChemWriter isn’t currently something that most chemists would have a need to buy for themselves. However, given numerous questions I’ve fielded on this very topic, the demand for alternatives in this space is quite clear.

Over the last five years, ChemWriter has evolved significantly, mostly driven by excellent customer feedback. A lot has been learned about what works and what doesn’t for browser-based chemical structure editors.

Now it’s time to apply that experience to building a great chemical structure editor app for iPad devices.

Fast and Fluid

Users of tablet computers expect a different experience than users of desktops, and it’s important to deliver on those expectations. Although tablet computing resources like memory and processor power are quite constrained compared to desktops, it’s critical that these limitations never show themselves. From the first moment an app is launched, every user interaction with it needs to be fast, fluid, and intuitive.

For this reason, the ChemWriter app is being written from scratch around the touch screen. Details of the development process may be the subject of future posts. Suffice it to say that the ChemWriter app is being written with a keen respect for the iPad’s strengths and limitations.

Features and Pricing

A two-tiered pricing model is currently envisioned in which basic drawing functionality would be available in a free app. An in-app purchase would then enable premium features.

Free Features
  • Draw molecules with an emphasis on fast, efficient structure creation.
  • Draw non-molecule shapes including straight/curved reaction arrows, boxes, geometrical shapes, and curves.
  • Search select databases by exact- and substructure.
  • Obtain Structure from name.
Premium Features
  • Save as Scalable Vector Graphics (SVG) files.
  • Save as high-resolution PNG images.
  • Save as ChemDraw (*.cdx) files.
  • Load ChemDraw files.
  • Calculations including molecular weight, InChI, and SMILES.
  • iCloud and DropBox synchronization.

Initial releases would focus on enabling the fast and efficient drawing of chemical structures through a touch interface. A major problem to be addressed is how to enable precision structure drawing using a non-precision implement (the finger).

Also available in the first release would be the ability to read and write files in a variety of formats. Chemistry formats including molfiles and ChemDraw® files would both be supported, as would image formats including SVG and PNG. The app would support seamlessly moving these file representations into and out of popular cloud-based storage utilities including iCloud® and DropBox®.

Subsequent released would focus on calculations and derivation of other kinds of information from chemical structures.

Conclusions

Tools for drawing and using chemical structures are essential in organic chemistry and related fields. A new app based on ChemWriter is in development that aims to bring high-quality structure drawing and analysis to iPad devices.

Optimizing Organic Reactions with Design of Experiments and Principal Component Analysis

January 16th, 2013

Few topics in organic chemistry are more important than reaction optimization. The availability of an efficient reaction can add millions of dollars to the bottom line of a company. Likewise, access to a practical reaction often opens up entirely new areas of scientific study, as was the case with Suzuki coupling.

A recent paper by CatScI on reaction optimizationn combines two powerful, although not widely-known techniques to reaction optimization: Design of Experiments (DoE) and Principal Component Analysis (PCA).

The Reaction Optimization Problem

A reaction can be thought of as a system accepting a number of inputs (parameters) and providing one or more outputs. Example inputs might include: temperature; solvent; pH; catalyst; and time. Example outputs might include: yield; selectivity; purity; and cost. The goal of reaction optimization is to select the best inputs to achieve a given output.

It’s all too easy to forget that even the simplest reactions can accept multiple inputs. Yet limitations of time and money make the exhaustive exploration of every input impractical.

Given very real practical constraints, what’s the best way to optimize multiple reaction inputs to give the desired outputs?

Why Single Variable Optimization Can Fail

A popular way to deal with the multi-variable nature of reaction optimization is the One Variable at a Time (OVAT) approach. Here, all experimental inputs, except one, are kept constant, let’s say time. An output, let’s say yield, is then recorded at multiple time values. In this way, the “optimal” reaction time is revealed. This reaction time is then kept constant and another variable is chosen, for example temperature. The process continues until all inputs have been probed and a set of optimal inputs have been determined.

Many reactions have been optimized using the OVAT approach, so what’s the problem?

Understanding the problem with OVAT can be made easier by visualizing reaction space. For the highly simplified example above, reaction space can be represented as a plot in which time and temperature are axes. The color of a point represents yield at a given time and temperature, with red representing higher yield and black representing lower yield.

The study might begin at Point “S” with an initial discovery or literature procedure. Using a systematic approach, two temperatures on either side of “S” are probed. Likewise, yields at two times on either side of “S” are determined.

All modifications to the original procedure resulted in equal or lower yields. The conclusion would likely be that the original conditions were optimal - case closed.

And this is where the problem lies. Given infinite resources, we might perform a more comprehensive study and discover the following “response surface”:

The OVAT study identified a local maximum at Point “S”. It failed to identify a much better combination of time and temperature at point “M”.

Of course, resources are always limited. So other than blind luck, how can the risk of getting stuck on a local maximum be reduced?

Billions and Billions

The CatScI paper describes an interesting though experiment. Given a generic Suzuki reaction, how many runs would needed to fully map reaction space, and therefore identify an absolute yield maximum?

Before calculating the number of runs, it’s important to distinguish between two fundamentally different types of reaction parameters (the term used in the paper for inputs). These are “discrete” parameters and “continuous” parameters.

A continuous parameter can take an infinitely divisible range of values. Examples of discrete parameters in the Suzuki reaction might include: temperature; time; and palladium precatalyst loading.

In contrast, a discrete parameters can only be selected from a list of values. Discrete parameters in the Suzuki reaction might include: ligand identity; base identity; palladium source; and order of addition.

To calculate the number of runs needed to map the Suzuki reaction space, let’s only choose two values for continuous parameters - one high value and one low value. Note that even this simplification leaves vast regions unmapped.

For discrete parameters, each possible value will need to be present in the set of runs.

The combinatorial increase in the number of runs can be calculated easily:

Although after 51.2 million runs we’d have a good idea of the gross topology of the reaction space, this knowledge would come with an impractical price tag. Furthermore, we’d be missing a lot of information. For example, just increasing the number of runs for continuous parameters from two to four gives 6.6 billion runs. And this number doesn’t take into account duplicate runs needed to reduce error.

Referring back to the combinatorial diagram above, notice how the number of ligands and solvents, and the fact that ligand and solvent identities are both discrete parameters, greatly contributed to the increase of required runs.

Reducing Discrete Parameters (and Reaction Runs) Through Principal Component Analysis

To arrive at a practical number of runs needed to map reaction space, the CatScI authors turned to Principal Component Analysis (PCA). At its simplest level, PCA is a mathematical technique that can be used to convert a discrete parameter into one or more continuous parameters.

In an early application of PCA to chemistry, Rolf Carlson’s group attempted to solve the problem of choosing a reaction solvent. Noting the inherent problems in working with loosely-defined concepts such as “polarity”, Carlson’s group wanted a better way to classify solvents.

Starting from experimentally-derived physical properties, an approach was developed that was capable of placing each of 63 common solvents into a two-dimensional “solvent space”. In principle, even more axes (dimensions) could be added to increase precision.

This work made it possible to refer to organic solvents, not by name or ill-defined concepts like polarity, but by their coordinates on a grid constructed mathematically from measured physical properties.

More recent application of this technique using computationally-derived properties led to the development of a “ligand space” for 348 known monodentate phosphine ligands by Fey and coworkers.

Although the underlying math is quite complicated, the pattern is simple enough. Given any set of discrete parameters, define a group of measurements for each item. The number of measurements can be as large as necessary, and can even be computationally-derived. Then, using PCA, derive a continuous, multidimensional space in which each item can be assigned a coordinate.

PCA itself is not without limitations. For example, the effectiveness of the approach greatly depends on the set of descriptors assigned to each item in a continuous parameter set. Moreover, the most appropriate descriptor set can vary from application to application.

Bearing these and other PCA limitations in mind, the big win was being able to convert the discrete parameters of solvent and ligand into continuous parameters. This conversion made feasible a DoE study that would have otherwise been impractical.

DoE and PCA In Practice

To test the approach, the CatScI group optimized a Buchwald-Hartwig sulfamidation reaction with the goal of identifying a cost-effective ligand without intellectual property (IP) restrictions:

A variant of DoE was used in which an initial broad and shallow iteration revealed reaction space “hot spots” that were then probed with progressively narrow focus in subsequent iterations:

  1. With the goal of finding an alternative ligand, 35 runs were made in which 9 ligands were tested with nine solvents and two palladium precatalysts.
  2. Twelve runs in the hot spot from Iteration 1 then identified an additional four ligands giving high or complete conversion.
  3. Twelve more runs identified four more ligands.
  4. Analysis of the ligands in Iteration 3 identified one with no intellectual property restrictions and and acceptable cost.
  5. Nineteen runs were made with the aim of optimizing various stoichiometries using the ligand found in Iteration 4.

Helpful Resources

Although an excellent illustration of the utility of combining PCA and DoE, the CatScI paper does little to answer specific implementation questions. What software, if any, was used for the DoE? How exactly were ligands chosen using the Fey phosphine ligand space? What special statistical analysis, if any, was used to progress from iteration to iteration?

These and other questions are not answered. This is not to fault the authors, because the paper is clearly marked as a “Concept Article”. Still, even at this level of detail the study offers a compelling argument with plenty of jumping off points.

But the question remains: how can a chemist interested in trying either DoE, PCA, or both together in the lab get started?

A number of resources are available, some of which are summarized below:

Introductions to DoE with a Chemistry Emphasis
Introduction to Principal Component Analysis
Interactive Tutorial
  • Stat-Ease [Step-by-step interactive tutorial using DoE software]
Blog Posts on DoE Applied to Chemistry
Some Other Applications of DoE to Organic Syntheis
Software with Free Trials

Conclusions

Design of Experiments (DoE) offers compelling advantages over single variable optimization for organic synthesis. Principal Component Analysis (PCA) can greatly reduce the number of runs required to map reaction space during a DoE experiment. Using both techniques together offers important advantages over the single-variable optimization approach.

Interactive, Browser-Based 3D Molecule Visualizations with GLmol and WebGL

January 10th, 2013

Although many tools for 3D visualization of small molecules and biopolymers have been released as desktop applications, relatively few programs are available for use in Web applications. GLmol is one such tool that takes advantage of fast in-browser 3D graphics capabilities now available through WebGL. This article introduces GLmol by discussing its main features, and provides fully-functional examples of deployment and scripting.

Application Demo

The GLmol download package contains a sample page illustrating the most important features:

  • Load PDF files from a local directory or by URL
  • View proteins using a variety of standard representations, including: thick ribbons; thin ribbons; strands; and cylinders/plates.
  • Zoom, pan, translate, and slab
  • Show additional information, including crystal packing, unit cell, and sidechains.
  • A variety of coloring options.
  • Screenshot capture

Deployment Demo: Embedding a Protein Structure

GLmol can read PDB files loaded via asynchronous HTTP calls. This use is illustrated in the protein embedding demo:

Note that same origin policies may prevent direct loading of files from sources other than the original host in some situations.

Scripting Demo: Spinning Molecule

GLmol is written entirely in JavaScript. As such, the software lends itself to a variety of interesting scripting techniques by default. The spinning molecule demo shows how to combine GLmol with the requestAnimationFrame API to animate a scene:

WebGL Support

WebGL is now supported on all modern browsers except Internet Explorer. Microsoft has so far not publicly indicated whether WebGL would be supported in IE 11.

Other Software

Other pure JavaScript 3D molecule display components have been described. One of the authors of Jmol has developed a Java- and WebGL-free version of the software called JSmol. Jolicule is a JavaScript application for 3D molecular visualization that also requires no WebGL. iChemLabs offers a set of 3D Structure Canvases based on WebGL.

Those wanting a more detailed look at how to use WebGL in the context of a molecule display element may find a recent tutorial helpful.

Conclusions

GLmol offers many possibilities for both interactive and scripted 3D molecule visualizations. Avoiding the Java Plugin and its accompanying complications, GLmol offers an attractive alternative to the popular Jmol applet. GLmol’s liberal open source license (MIT or GPL3) makes it an appealing starting point for further development.

Archive