Parsing SMILES from Scratch in JavaScript
SMILES is a ubiquitous language for storing and transmitting chemical structures. In most situations, detailed understanding of how SMILES parsers work is not necessary because that job is routinely handled by a toolkit.
But some situations call for much deeper understanding, for example:
- A new platform emerges: Robust parsers are not yet available. Think JavaScript, iOS, and Go.
- Performance bogs down: A standard parser is too slow and does a lot work that can be eliminated. Think LINGO.
- Science stagnates: SMILES has limits that may eventually hinder scientific progress either on your project, or more broadly. Think Open Smiles, InChI, or languages yet to be developed.
Early work on a SMILES plugin for ChemWriter served as my motivation to better understand the language. Because ChemWriter's environment is a web browser (or browser component), the first choice for implementation language is JavaScript.
This article, the first in a series, explains how I came to the conclusion that the best person to write a SMILES parser in JavaScript may actually be a computer.
SMILES Parsing from 20,000 Feet
The process of converting a raw SMILES string into a representation capable of solving a chemical problem can be divided into two main steps:
- Tokenization: Divide a line of text into a series of chunks representing the various components of the SMILES language. These include atoms, bonds, chains, branches, and ring closures.
- Token Manipulation: Given an ordered list of tokens, generate a representation suitable for the task at hand. The complexity of these tasks could span a wide range. For example, calculating exact molecular mass from SMILES tokens would be relatively simple. On the other hand, creating a record in an indexed chemical database would be considerably more complicated.
Perhaps surprisingly, tokenizing SMILES strings with high fidelity (Step 1) can be a much more difficult and labor-intensive process than subsequent manipulation of the tokens.
The Standard Approach: Hand-Crafted Parsers
Andrew Dalke notes that most SMILES parsers in existence today arose from a highly manual process, citing the example of the Open Babel SMILES implementation that distinguishes "C", "Cl", "N" and "O":
if (isupper(*_ptr)) {
switch(*_ptr) {
case 'C':
_ptr++;
if (*_ptr == 'l')
{
strcpy(symbol,"Cl");
element = 17;
}
else
{
symbol[0] = 'C';
element = 6;
_ptr--;
}
break;
case 'N':
element = 7;
symbol[0] = 'N';
break;
case 'O':
element = 8;
symbol[0] = 'O';
break;
// etc
}
}
Let's be clear: working code rules and Open Babel has been parsing SMILES successfully for years. Nevertheless, the hand-crafted parser approach leads to three important consequences:
Language lock-in: Moving this code to any other programming language will be challenging at best. Deep understanding of syntax and symantics remains locked inside the codebase.
Painful refactoring: Deep control logic makes it difficult to make cross-cutting changes. This is less important with mature parsers, but for rapidly-evolving parsers it can spell disaster.
Swimming in code: There's no place a new developer can turn to for a high-level overview of how this code works.
More recently Orion Jankowski raised similar concerns based on his own work toward a SMILES parser for Haskell.
The Alternative: Parser Generator
These problems are of course not unique to SMILES - any computer language or file format ever created has faced similar issues. Fortunately, a very powerful solution has been developed in the form of parser generators.
Given a high-level description of any language (also known as a "grammar"), a parser generator can produce a working parser automatically. The problem of writing a parser then boils down to formulating a grammar. Because grammars encode high-level concepts concisely, they serve as a much better human-to-human communication medium than raw code. Grammars can be machine transformed not just into working software, but other useful forms as well such as graphical visualizations.
Conclusions
Using a parser generator can have far-reaching and beneficial consequences in a software project. This article hinted at some of the advantages. Future articles in this series will illustrate them in more detail.