Blog

Signals Blog

Reading Chemical Structures from Images with OSRA 2.0

Version 2.0 of the Optical Structure Recognition Application (OSRA) has been released. This open source software does for chemical structure images what Optical Character Recognition programs do for printed documents. In other words, OSRA converts images of chemical structures into machine-readable chemical structures. Uses include automated mining/analysis of the chemical literature and patents.

OSRA can be used via a webservice. Unfortunately, it appears to currently run the older 1.4.0 release.

Installing OSRA

OSRA is difficult to build from source because of its numerous dependencies. The build documentation is a good start, but your particular platform will likely require some changes. The availability precompiled OSRA binaries — for a price — suggests how tricky it can be.

Given the widespread popularity of Ubuntu, it seemed that step-by-step documentation for building OSRA 2.0 for this platform could be helpful. A detailed procedure for building and installing OSRA 1.3.8 on Ubuntu 11.10 was written, but both of these systems are relatively old.

Using the available documentation as a starting point, the following procedure was developed to compile and install OSRA 2.0 on a clean, build-enabled Ubuntu 13.04 system.

Install Dependency Packages

$ sudo apt-get install libtclap-dev libpotrace0  libpotrace-dev  libocrad-dev libgraphicsmagick++1-dev libgraphicsmagick++1-dev libgraphicsmagick++3 libgraphicsmagick1-dev libgraphicsmagick3 libnetpbm10-dev
 

Install Patched GOCR

$ git clone https://github.com/metamolecular/gocr-patched.git
$ cd gocr-patched
$ ./configure
$ make libs
$ sudo make all install
 

Install Open Babel 2.3.2

Open Babel installed by apt-get didn't work, resulting in a compile-time error complaining about UnknownWinding not being found. Compiling from source solved the problem.

$ git clone https://github.com/openbabel/openbabel.git
$ cd openbabel
$ git checkout openbabel-2-3-2
$ mkdir build
$ cd build
$ cmake ..
$ make -j2
$ sudo make install
$ echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib >> ~/.profile
$ . ~/.profile
 

The last two lines are necessary only if you receive this error on running OSRA:

osra: error while loading shared libraries: libopenbabel.so.4: cannot open shared object file: No such file or directory

Install OSRA 2.0

$ git clone https://github.com/metamolecular/osra.git
$ ./configure --with-openbabel-include=/usr/local/include/openbabel-2.0 --with-openbabel-lib=/usr/local/lib/openbabel
$ make all
$ sudo make install
 

Using OSRA

Running OSRA with default settings on a sample image will, after a few seconds, generate a list of SMILES representing the contained structures.

$ wget http://osra.sourceforge.net/patent.gif
$ osra patent.gif 
OC(=O)C(CCCCOCCCCC(C(=O)O)(C)C)(C)C
O=C(C(C)(C)C)OCOP(=O)(C(S(=O)(=O)O)CCCc1cccc(c1)Oc1ccccc1)OCOC(=O)C(C)(C)C
CCC(CC1(CCCCC1)C(=O)Nc1ccccc1SC(=O)C(C)C)CC
OC(=O)[C@@H](Nc1ccccc1C(=O)c1ccccc1)Cc1ccc(cc1)OCCc1nc(oc1C)c1ccccc1
COc1cccc(c1)N1CCN(CC1)CC(=O)NC1c2c(O)c(C)cc(c2CC1(C)C)C
O=C1ONC(=O)C1Cc1ccc(cc1)OCCc1nc(oc1C)c1ccccc1
OC(=O)C[C@]1(O)C[C@@H](OC1=O)CCCCCCc1ccc(cc1Cl)Cl
CCOP(=O)(Cc1ccc(cc1)C(=O)Nc1ccc(cc1C#N)Br)OCC
O=C(C(=O)O)Nc1ccccc1C(=O)O