Reading Chemical Structures from Images with OSRA 2.0
Version 2.0 of the Optical Structure Recognition Application (OSRA) has been released. This open source software does for chemical structure images what Optical Character Recognition programs do for printed documents. In other words, OSRA converts images of chemical structures into machine-readable chemical structures. Uses include automated mining/analysis of the chemical literature and patents.
OSRA can be used via a webservice. Unfortunately, it appears to currently run the older 1.4.0 release.
Installing OSRA
OSRA is difficult to build from source because of its numerous dependencies. The build documentation is a good start, but your particular platform will likely require some changes. The availability precompiled OSRA binaries — for a price — suggests how tricky it can be.
Given the widespread popularity of Ubuntu, it seemed that step-by-step documentation for building OSRA 2.0 for this platform could be helpful. A detailed procedure for building and installing OSRA 1.3.8 on Ubuntu 11.10 was written, but both of these systems are relatively old.
Using the available documentation as a starting point, the following procedure was developed to compile and install OSRA 2.0 on a clean, build-enabled Ubuntu 13.04 system.
Install Dependency Packages
$ sudo apt-get install libtclap-dev libpotrace0 libpotrace-dev libocrad-dev libgraphicsmagick++1-dev libgraphicsmagick++1-dev libgraphicsmagick++3 libgraphicsmagick1-dev libgraphicsmagick3 libnetpbm10-dev
Install Patched GOCR
$ git clone https://github.com/metamolecular/gocr-patched.git
$ cd gocr-patched
$ ./configure
$ make libs
$ sudo make all install
Install Open Babel 2.3.2
Open Babel installed by apt-get didn't work, resulting in a compile-time error complaining about UnknownWinding
not being found. Compiling from source solved the problem.
$ git clone https://github.com/openbabel/openbabel.git
$ cd openbabel
$ git checkout openbabel-2-3-2
$ mkdir build
$ cd build
$ cmake ..
$ make -j2
$ sudo make install
$ echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib >> ~/.profile
$ . ~/.profile
The last two lines are necessary only if you receive this error on running OSRA:
osra: error while loading shared libraries: libopenbabel.so.4: cannot open shared object file: No such file or directory
Install OSRA 2.0
$ git clone https://github.com/metamolecular/osra.git
$ ./configure --with-openbabel-include=/usr/local/include/openbabel-2.0 --with-openbabel-lib=/usr/local/lib/openbabel
$ make all
$ sudo make install
Using OSRA
Running OSRA with default settings on a sample image will, after a few seconds, generate a list of SMILES representing the contained structures.
$ wget http://osra.sourceforge.net/patent.gif
$ osra patent.gif
OC(=O)C(CCCCOCCCCC(C(=O)O)(C)C)(C)C
O=C(C(C)(C)C)OCOP(=O)(C(S(=O)(=O)O)CCCc1cccc(c1)Oc1ccccc1)OCOC(=O)C(C)(C)C
CCC(CC1(CCCCC1)C(=O)Nc1ccccc1SC(=O)C(C)C)CC
OC(=O)[C@@H](Nc1ccccc1C(=O)c1ccccc1)Cc1ccc(cc1)OCCc1nc(oc1C)c1ccccc1
COc1cccc(c1)N1CCN(CC1)CC(=O)NC1c2c(O)c(C)cc(c2CC1(C)C)C
O=C1ONC(=O)C1Cc1ccc(cc1)OCCc1nc(oc1C)c1ccccc1
OC(=O)C[C@]1(O)C[C@@H](OC1=O)CCCCCCc1ccc(cc1Cl)Cl
CCOP(=O)(Cc1ccc(cc1)C(=O)Nc1ccc(cc1C#N)Br)OCC
O=C(C(=O)O)Nc1ccccc1C(=O)O