Enabling exact structure search and substructure search on a website can be a mysterious process. Fortunately, a number of tools and techniques are available to help. This article attempts to demystify, from the perspective of a non-expert, the process of enabling exact- and substructure search for databases hosted on the Internet or a private intranet.
You maintain a website that acts as a gateway to a chemical database. Site visitors, customers, and perhaps your boss, have been increasingly interested in being able to search the site by exact- and substructure.
It's pretty clear that text search is simple - your content management system already has a plugin for that. Your site visitors can already enter names, CAS numbers, and other identifiers into a text box and get their results. But what about structure search?
If the above description matches your current situation at all, then this is the article you've been looking for. Read on for how to solve the problem.
The Bad News
The bad news is that you may not be able to solve the problem yourself - outside help may be required, depending on your expertise. At the very minimum, you'll probably need a capable, persistent web developer who has the willingness and inclination learn new things.
- Web Application Framework. This is the software around which your current or future website will be built. Examples of popular web application frameworks include: PHP; Ruby on Rails; ASP.net; and Django. Each of these frameworks support a number of content management systems. For example, both WordPress and Drupal are based on PHP.
- A Database. If you're already running a website of any size, chances are good you already have a database in place. Any database will do - popular examples include MySQL, Oracle, and Microsoft SQL Server.
- Machine-Readable Chemical Structure Representations. Over the years, a number of molecular file formats have been developed. One of the most widely-used and best-supported of these is molfile.
- Structure Canonicalizer A canonicalizer converts multiple representations of the same chemical structure into the same string of characters. By storing precomputed canonicalized strings in your database, fast exact structure search can be enabled by simply canonicalizing the query structure and performing an exact string match - a very efficient process. Although a number of canonicalization systems have been developed, I can highly recommend IUPACs InChI, a free software package.
- Substructure Matcher This software performs substructure matching though atom-by-atom search. As the term implies, the matcher compares every atom of of a database molecule molecule at least once with every atom in a query structure - typically many more times than that. This process is repeated for every structure in your database. For very large databases, this process can be very slow. A number of commercial and free packages are available. Examples of free packages include our own MX; CDK; OpenBabel; and RDKit; and Indigo.
- Fingerprint Generator For structure collections of around 2,000 or more, a fingerprint generator can dramatically accelerate substructure search queries. This software converts chemical structures into fixed-length binary fingerprints. A fingerprint that matches that of a query structure may be a match for a query. But when a query fingerprint does not match a query fingerprint, no match is possible. So, the number of structures requiring atom-by-atom search can be greatly reduced - leading to big performance gains. Many packages capable of performing substructure search are also capable of fingerprint generation.
- Chemical Structure Editor. Enables chemical structure input on a Web page. One ergonomic and fast-loading option is ChemWriter, a product sold by us.
- Chemical Structure Viewer. Displays chemical structures, typically as part of a hitset or record detail. Server-side solutions do exist, but ChemWriter also supports browser-side structure rendering directly from a molfile.
For a Web-based structure search system to really work, just having components is not enough - the site itself requires some additional layout in preparation.
- Individual Structure Pages To the extent that you can think of your database as a collection of records or rows in a table, each record should have a uniquely-addressable page on your site. These are the pages a user will be sent to when they click the links in your structure search results.
- A Query Interface Typically a single, dedicated page. Its sole purpose is to accept user input in a form, and then forward this form to your server. For an example of how structures can be submitted to a server, see the ChemWriter documentation page.
- A Results Page This page displays search results resulting from a query. It's typically a template that gets populated with specific search results.
Bringing It All Together
The resulting structure search system will consist of components that run on the browser and components that run on the server.
Let's say Sally, one of your site's visitors, wants to perform a substructure search for aspirin in your database of 5,000 structures. By following a link, she's directed to the structure search page. Drawing the structure into the editor provided on the page, she then clicks 'Submit'.
On receiving that signal, the browser prepares a substructure structure search query as form data, and POSTs it to your server. Your server generates a fingerprint for Sally's structure and scans your database looking for primary matches. Twenty-two records are found.
At this point your server has a list of candidate structures, but not actual hits. To get hits, an atom-by-atom search is performed with the query structure against all twenty-two candidate structures. Thirteen are substructure matches.
Next, your server prepares a webpage to display the thirteen matching records. A link to each structure record is provided in the view.
After submitting her query, Sally then sees a results page listing thirteen records, along with structure images. Clicking on the first takes her to the aspirin summary page.
Enabling exact- and substructure search on a web page requires a number of components working together. Although no step or component is overly complicated, getting all of the pieces to work together smoothly can take some effort and expertise, as can the identification of the best components and site layout.
If you'd like to learn more about enabling exact- or substructure search on your website, I'd like to hear from you.