Blog

Signals Blog

Substructure Search for Websites

Structure search is a frequent request among users of a website hosting chemical collections. If you run such a site and would like to make it much easier to find chemicals by exact structure and substructure, how would you do it?

This article presents a high-level discussion of the main problems and solutions for adding structure search capabilities to an existing website.

Content Management Systems

You may have used an off-the-shelf content management system (CMS) to build your site. These products are excellent for building general-purpose websites, but they were never designed for scientific applications like substructure search. Before you even begin, integration will be a major consideration.

CMSs differ greatly in their extensibility. For example, one popular ecommerce solution, Magento, exposes a plugin interface, with a large number of pre-built modules available. Other CMSs offer similar options. So one solution could be to buy or build a CMS plugin.

Another option might be to extend the CMS more directly. However, this approach is only viable if the CMS was released under an Open Source license and is under your direct control (not hosted by a third party).

Either option, a CMS plugin or direct extension, is likely to require deep integration into the database level. The easier a CMS makes it to alter its database and add new behaviors, the higher the chances of success.

To the extent that a complete site redesign is impractical, your initial choice of CMS will greatly constrain the options for implementing substructure search.

Substructure Search Systems

Substructure search systems are widely available. For collections of fewer than 100,000 structures, flexibility, not performance, will be the deciding factor. The importance of flexibility becomes most apparent when linking a substructure search system with a CMS on a server.

The software environments running popular CMSs are not generally the same ones used to develop substructure search systems. For example, many CMSs are written in PHP, a language that by itself is incompatible with the languages in which most substructure search software is written. A separate interface is required.

Substructure search systems have mainly been written in Java, C/C++, and Python. Although commercial products exist, Open Source alternatives will work well in many situations. They include:

One integration approach involves deploying the substructure search software into the database system via a "chemistry cartridge". The CMS can then interact with the chemistry software though SQL, a well-supported interface. Alternatively, substructure search functionality can be exposed through a local network interface or other inter-process communication methods.

A more detailed look at the implementation details of substructure search is available in a separate article.

Data Conversion

In addition to software requirements, implementing a substructure search system for your website will make new demands on your data.

Many chemical collections start out as lists of Chemical Abstracts Service (CAS) numbers and/or trivial/systematic names. These text identifiers by themselves are insufficient for substructure search. What's needed is a way to convert these text identifiers into a machine-readable chemical structure representation.

The goal of your data migration will most likely be to generate an SD File. An SD File combines machine-readable chemical structures with data about those structures. For example, an SD File for a chemical supplier might associate a catalog number with quantity/pricing information for each structure.

The options for converting CAS numbers and names to chemical structures are much more limited than for substructure search. The main problem in converting CAS numbers is that the only authoritative source for CAS to structure mappings is Chemical Abstracts Service itself. The expense of this service prevents many from using it. But given that the right steps are taken, large numbers of accurate CAS to structure conversions can be made automatically by using public datasets such as PubChem.

Systematic names can be converted to chemical structures by software. One free option that works very well in most cases is OPSIN. Commercial name to structure packages are also available.

Using name to structure software and possibly a dictionary derived from one or more public datasets, a spreadsheet of names and CAS numbers can be compiled to an SD file. Although the cleanliness of the spreadsheet data and the accuracy of the conversion process may be quite high, a manual check of the results is highly recommended.

The result of this data conversion is an SD File that can be used to create a structure searchable database.

Putting It All Together

To get the most out of your substructure search system, it's essential to develop a reasonably accurate picture of how people will use it.

A search typically begins with a chemist drawing a structure. How will your site make this possible? A chemical structure editor is the first interaction most chemists will have with your site. A number of browser-based editors are available. One of them, ChemWriter, is available from Metamolecular.

What happens after a chemist submits a search structure? Clearly, a list of chemicals will be returned, but in what format? Will you display structures and data together or just one of the two? Will you provide a link to a detail view for each result? If you're hosting a chemical catalog, do you want to show customers a "similar products" sidebar?

Some applications will require more power than others. How important will advanced features be to your users? For example, similarity search might be very helpful in some situations, but less so in others. Likewise, query atom capability could mean the difference between a useful application one that raises complaints.

These are just some examples of the things to consider when designing a structure search user interface.

Custom Site

Given the significant integration issues inherent in extending a general-purpose CMS with substructure search combined with the centrality of chemical structures to chemistry, another option might be to rebuild the site with integrated chemical structure handling capabilities.

Such an option makes the most sense when you know the site will be doing more than just substructure search. For example, you might "publish" new products directly to your catalog through a chemically-aware admin interface.

The biggest downside is, of course, the relatively high up-front cost. But being able to recapture these costs through increased revenue or cost reduction in other areas of the business might make a custom site with integrated chemical capabilities an attractive alternative.

Web Services

Web services offer yet another approach to integrating structure search. In this case, a remote web server would host your structure collection and the search software. Communication with your site would take place, not at the level of databases and CMS behavior, but at the level of the browser.

The main problems with this approach are the relatively poor quality integration provided by existing solutions and their inflexibility. As with a CMS extension, data conversion would still likely be a necessary step. But for projects on tight budgets, the web services route might be good enough.

Conclusions

Enabling substructure search on a website is possible, but not trivial. A key consideration in adapting an existing site is integration with third-party software. Even with all of the components in place, data conversion will likely be necessary. User interface considerations and workflow become most visible after the low-level work has been done, but should really be addressed up-front. A custom site may be the best option when all factors are considered.

Despite the costs and difficulties, adding substructure search can be a game-changer.

Metamolecular has worked with multiple companies seeking to build structure-searchable websites. If you're interested in doing the same, please feel free to contact me for a free consultation.

Resources