Blog

Signals Blog

Create a SMILES Grammar and Parser with PEG.js

Most SMILES parsers in use today were hand crafted. In other words, a team of developers transcribed a written specification into detailed instructions written in a general purpose programming language. The task is tedious, error-prone, and time-consuming - exactly the kind of work that computers excel at.

Parser generators offer an automated alternative capable of transforming a high-level language specification into running code. This article, part of a continuing series, demonstrates the process of building a SMILES grammar and auto-generated parser with PEG.js.

Baby Talk

SMILES is a non-trivial language capable of representing a large swath of known chemistry. Rather than diving straight into a full grammar, let's start with a subset consisting of a few basic features.

Consider a dialect of SMILES that encodes only unbranched, saturated carbon chains:

Straight-chain saturated hydrocarbon SMILES subset
String Substance
C Methane
CC Ethane
CCC Propane

Using the online PEG.js tool, we can define a grammar for this language.

Straight-chain saturated hydrocarbon SMILES subset grammar
SMILES = atom+

atom = 'C'
 

This grammar can be entered into the left-hand side of the online PEG tool, followed by sample input to the right. Parsing the string 'CCC' returns the expected result.

JSON result of parsing 'CCC'
[
   "C",
   "C",
   "C"
]
 

Supporting More Atom Types

SMILES supports a range of atom types in the so-called "organic subset". Let's add them as well.

Straight-Chain Organic Subset Atoms
SMILES = atom+

atom = 'B''r'? / 'C''l'? / 'N' / 'O' / 'P' / 'S' / 'F' / 'I'
 

In PEG.js, alternatives are separated by a forward slash (/) and are processed left-to-right. Distinguishing Cl from [C + lowercase l] requires the ordering to be as shown.

Running the resulting parser on the string 'BrCCCl' returns the expected result.

JSON result of parsing 'BrCCCl'
[
   [
      "B",
      "r"
   ],
   [
      "C",
      ""
   ],
   [
      "C",
      ""
   ],
   [
      "C",
      "l"
   ]
]
 

Our grammar states that each atom consists of one or two characters. PEG.js fulfilled this request by returning an array of arrays, each containing one or two matched characters.

But for any real SMILES parser, element symbols should be represented as Strings. How can we get PEG.js to do this?

Mixing Code with Grammar

PEG.js supports the transformation of matched elements using inline JavaScript functions. For example, to get our parser to return one String for each element symbol, we'd use named arguments together with a function that produces a string from an array.

PEG.js grammar containing inlined JavaScript function
SMILES = atom+

atom = symbol:('B''r'? / 'C''l'? / 'N' / 'O' / 'P' / 'S' / 'F' / 'I') {
  return symbol.join('');
}
 

The new parser now returns an array of Strings.

JSON result of parsing 'BrCCCl'
[
   "Br",
   "C",
   "C",
   "Cl"
]
 

Notice how this approach leads to both a grammar and a parser. Both remain synchronized throughout the development cycle. Not only can we parse the SMILES language, but we can easily communicate how both the language and parser work to non-experts. Fixing parsing bugs automatically results in fixing the grammar - and vice versa.

Smidge is a complete SMILES parser developed using the procedure described here.

Using the Parser

The parser generated by PEG.js is a standalone JavaScript module that accepts arbitrary SMILES input. To obtain the parser, click the "Download Parser" button in the lower-right of the online tool.

A parser can also be produced from a command-line build tool. Given an environment with both Node.js and the pegjs package, a short program prints the parser source code.

PEG.js from the command line
var PEG = require('pegjs');
var grammar = 'SMILES = atom+\natom = \'C\'';
var parser = PEG.buildParser(grammar);

console.log(parser.toSource());
 

Conclusions

Parsers for SMILES and many other languages can be developed with PEG.js via a two-step iterative procedure:

  1. Define a grammar component capable of matching a SMILES language feature.
  2. Write an inline JavaScript function to process the captured feature.

An important advantage over the more traditional manual approach is that using the parser generator results in both a working parser and a grammar. The grammar can in turn be used as high-level documentation and as a starting point for automated tools.