Blog

Signals Blog

Parsing SMILES from Scratch in JavaScript

SMILES is a ubiquitous language for storing and transmitting chemical structures. In most situations, detailed understanding of how SMILES parsers work is not necessary because that job is routinely handled by a toolkit.

But some situations call for much deeper understanding, for example:

  1. A new platform emerges: Robust parsers are not yet available. Think JavaScript, iOS, and Go.
  2. Performance bogs down: A standard parser is too slow and does a lot work that can be eliminated. Think LINGO.
  3. Science stagnates: SMILES has limits that may eventually hinder scientific progress either on your project, or more broadly. Think Open Smiles, InChI, or languages yet to be developed.

Early work on a SMILES plugin for ChemWriter served as my motivation to better understand the language. Because ChemWriter's environment is a web browser (or browser component), the first choice for implementation language is JavaScript.

This article, the first in a series, explains how I came to the conclusion that the best person to write a SMILES parser in JavaScript may actually be a computer.

SMILES Parsing from 20,000 Feet

The process of converting a raw SMILES string into a representation capable of solving a chemical problem can be divided into two main steps:

  1. Tokenization: Divide a line of text into a series of chunks representing the various components of the SMILES language. These include atoms, bonds, chains, branches, and ring closures.
  2. Token Manipulation: Given an ordered list of tokens, generate a representation suitable for the task at hand. The complexity of these tasks could span a wide range. For example, calculating exact molecular mass from SMILES tokens would be relatively simple. On the other hand, creating a record in an indexed chemical database would be considerably more complicated.

Perhaps surprisingly, tokenizing SMILES strings with high fidelity (Step 1) can be a much more difficult and labor-intensive process than subsequent manipulation of the tokens.

The Standard Approach: Hand-Crafted Parsers

Andrew Dalke notes that most SMILES parsers in existence today arose from a highly manual process, citing the example of the Open Babel SMILES implementation that distinguishes "C", "Cl", "N" and "O":

Open Babel SMILES Parser
if (isupper(*_ptr)) {
  switch(*_ptr) {
    case 'C':
      _ptr++;
      if (*_ptr == 'l')
        {
          strcpy(symbol,"Cl");
          element = 17;
        }
      else
        {
          symbol[0] = 'C';
          element = 6;
          _ptr--;
        }
      break;

    case 'N':
      element = 7;
      symbol[0] = 'N';
      break;
    case 'O':
      element = 8;
      symbol[0] = 'O';
      break;

    // etc
  }
}

Let's be clear: working code rules and Open Babel has been parsing SMILES successfully for years. Nevertheless, the hand-crafted parser approach leads to three important consequences:

  1. Language lock-in: Moving this code to any other programming language will be challenging at best. Deep understanding of syntax and symantics remains locked inside the codebase.

  2. Painful refactoring: Deep control logic makes it difficult to make cross-cutting changes. This is less important with mature parsers, but for rapidly-evolving parsers it can spell disaster.

  3. Swimming in code: There's no place a new developer can turn to for a high-level overview of how this code works.

More recently Orion Jankowski raised similar concerns based on his own work toward a SMILES parser for Haskell.

The Alternative: Parser Generator

These problems are of course not unique to SMILES - any computer language or file format ever created has faced similar issues. Fortunately, a very powerful solution has been developed in the form of parser generators.

Given a high-level description of any language (also known as a "grammar"), a parser generator can produce a working parser automatically. The problem of writing a parser then boils down to formulating a grammar. Because grammars encode high-level concepts concisely, they serve as a much better human-to-human communication medium than raw code. Grammars can be machine transformed not just into working software, but other useful forms as well such as graphical visualizations.

Conclusions

Using a parser generator can have far-reaching and beneficial consequences in a software project. This article hinted at some of the advantages. Future articles in this series will illustrate them in more detail.