OSRA: Optical Structure Recognition
1. Description
2. Dependencies
3. Other acknowledgements
4. Compilation
5. Usage
6. License
7. Download
8. Web Interface
9. Author
Description:
OSRA is a utility designed to convert graphical representations of
chemical structures, as they appear in journal articles, patent documents,
textbooks, trade magazines etc., into SMILES (Simplified Molecular
Input Line Entry Specification - see
http://en.wikipedia.org/wiki/SMILES) or SD file -
a computer recognizable molecular structure format. OSRA can read a document
in any of the over 90 graphical formats parseable by ImageMagick - including
GIF, JPEG, PNG, TIFF, PDF, PS etc., and generate the SMILES or SDF representation of
the molecular structure images encountered within that document.
Note that any software designed for optical recognition is unlikely to
be perfect, and the output produced might, and probably will, contain
errors, so a curation by a human knowledgeable in chemical structures
is highly recommended.
News:
- OSRA 1.3.3 - the tables (boxes) around the structures are detected and
removed prior to processing. Added -R (--rotate) command line switch to
rotate the image. Modified debug output (-d option) to show the output from
superatom dictionary.
- OSRA 1.3.2 is out. The speed is further improved (by a factor of 2-3x)
by replacing ImageMagick libraries with GraphicsMagick. Also fixed
spontanious slowdowns on Windows platform.
- OSRA 1.3.1 is out. Improved speed and various bug fixes. PDF
processing now honors "--resolution" and "-r" command line options.
You can get higher quality results (at the expense of slower speed) running
with the following command line: osra -r 300 -f sdf file.pdf
Also, you can see the page number for structures from PDF documents with
"-e" or "--page" option.
- OSRA 1.3.0 is now available. New features include:
- OS X install package - one click to install OSRA on a Mac,
- A plugin for ChemBioDraw from
ChemBioOffice 2010 (requires ChemScript 12.0) - converts images from the
clipboard into ChemBioDraw editor molecular objects,
- Better recognition of high resolution images (above 300 dpi),
- Improved Symyx Draw plugin.
- I have presented the new algorithm used by OSRA for text/graphics
separation at GREC 2009. It is the first paper in session 4, you can find it
here in "Proceedings".
-
Version 1.2.2 is out. Superatom labels can now be edited by users - superatom.txt contains the SMILES strings for each recongized label and spelling.txt contains spelling variants of every label for cases where OCR engine is not reliable. Please note that the dependencies have changed - ocrad-0.18 is now required and RDKit support is temporarily suspended.
- Starting with version 1.2.1 there is a windows installer which
automatically installs a plugin if Symyx Draw is present. It also detects and auto
installs if necessary Ghostscript and GflAx libraries.
- OSRA manuscript has been published:
"Optical Structure Recognition Software To Recover Chemical Information:
OSRA, An Open Source Solution"
J. Chem. Inf. Model., 2009, 49 (3), pp 740–743.
-
Starting with version 1.2.0 plugins for
BKChem, MolSketch,
Symyx Draw, and
Scitegic PipelinePilot are now included with Windows zip archive.
Plugins allow for integration of OSRA functionality with chemical structure
editors and other chemoinformatics software.
Dependencies:
OSRA needs the following Open Source libraries installed:
- GraphicsMagick, image manipulation library, version 1.3.7 or later;
if installing from RPM make sure you have the following packages:
GraphicsMagick
GraphicsMagick-devel
GraphicsMagick-c++-devel
GraphicsMagick-c++
http://www.graphicsmagick.org/
- POTRACE, vector tracing library, version 1.7 or later,
http://potrace.sourceforge.net/
- GOCR/JOCR, optical character recognition library, version 0.43 or later
- NOTE that version 0.46 is NOT RECOMMENDED! It lacks the necessary libPgm2asc
library,
http://jocr.sourceforge.net/
- OCRAD, optical character recognition program, version 0.18 is required,
http://www.gnu.org/software/ocrad/ocrad.html
- TCLAP, Templatized C++ Command Line Parser Library, version 1.1.0,
http://tclap.sourceforge.net/
- OpenBabel, open source chemistry toolbox, version 2.2.0 or later;
if installing from RPM make sure you have the following packages:
openbabel
openbabel-devel
http://openbabel.sourceforge.net/wiki/Main_Page
Other acknowledgements:
OSRA also makes use of the following software (you do not need to
install it separately, it's included in the distribution):
Compilation:
Unpack downloaded source code for the OCRAD package. Do not compile or install it - OSRA will automatically patch it and compile the object files it needs. Compile and/or install all the other necessary dependencies. Unpack OSRA package. Edit the included Makefile to make sure you have the correct locations for potrace, gocr, openbabel, and tclap. Check that
GraphicsMagick++-config location (it's a script that comes from
GraphicsMagick installation). You might have to set LD_LIBRARY_PATH to /usr/local/bin or wherever you have installed OpenBabel. Set ARCH variable to one of the following: unix - for linux,unix,osx; win32 - for building on Windows MinGW environment. Running make should then generate the executable - osra.
More detailed instructions for compiling OSRA for OS X and Windows platforms are provided in the README file.
Usage:
OSRA can process the following types of images:
- Computer-generated 2D structures, such as found on the PubChem website,
http://pubchem.ncbi.nlm.nih.gov/,
black-and-white and color (use a resolution of 72 dpi),
- Black-and-white PDF and PostScript files, including multi-page ones. Please
note that you need Ghostcript installed for GraphicsMagick to be able to
parse these kinds of files. OSRA internally renders PS and PDF at a resolution
of 150 dpi, higher rendering resolution can be achieved with "-r" option,
- Scanned images - black-and-white, a resolution of 300 dpi is recommended,
though 150 dpi can also produce fair results. Please make sure the
scanned image is of reasonable quality - an input that's too noisy will
only generate garbage output.
Some common abbreviations, hetero atoms, fused and merged atomic
labels, hash and wedge bonds, and bridge bonds are currently
recognized. Formal charges, isotopes and some element
symbols, i.e. iodine ("I" -- looks too much like a straight line = single
bond), are not.
Command-line options:
./osra --help
will give you a list of available options with short descriptions.
Most common use: ./osra [-r <resolution>] <filename>
Resolution in dpi, default is 300 (unless it's a PS or PDF file as
mentioned above), filename is the name of your image file (or
PS/PDF document).
Other options:
-t, --threshold: Gray level threshold, default is 0.2
for black-and-white images,
-n, --negate: Inverts colors (for white on black images),
-o, --output: Sets a prefix for writing recognized images to files - i.e.
"-o tmp" will create files tmp0.png, tmp1.png... for
each of the structures,
-s, --size: Resize images on output - can be useful for running OSRA
as a backend for a webservice. Example: "-s 300x400".
-g, --guess: Prints out resolution guess when you chose to have automatic
resolution estimate.
-p, --print: Prints out the value of confidence function estimate.
-f, --format: Output format (either smi for SMILES or sdf for SD file
format)
-d, --debug: Print out debug information on spelling corrections. First
column - output from the OCR engine, second - result of spelling
correction, last - SMILES from the superatom dictionary, if any.
-a configfile, --superatom configfile: Superatom label map to SMILES (superatom.txt by default)
-l configfile, --spelling configfile: Spelling correction dictionary (spelling.txt by default)
-e, --page: Show page number for structures from multi-page PDF and PostScript documents
-R, --rotate: Rotate image clockwise by the number of degrees, i.e. -R 90
License:
This program is free software; the part of the software that was written
at the National Cancer Institute is in the public domain. This does not
preclude, however, that components such as specific libraries used in the
software may be covered by specific licenses, including but not limited
to the GNU General Public License as published by the Free Software Foundation;
either version 2 of the License, or (at your option) any later version;
which may impose specific terms for redistribution or modification.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307,
USA. See also http://www.gnu.org/.
See the file COPYING for details.
Download:
OSRA is Free and Open Source Software. You are welcome to download
and use it, provided that you understand the terms described above.
Participation in the development is highly encouraged!
Download OSRA
We also welcome your feedback - send us your comments, suggestions,
criticism, or praise to the contact email address below.
Web Interface:
To demonstrate the capabilities (and limitations) of OSRA we have created
the following web interface:
OSRA Web Interface
Try this sample image from the US Patent Office website first:
patent.gif. Use a resolution of 300 dpi.
Author:
Igor Filippov, igorf(AT)helix.nih.gov
2007-2008, SAIC-Frederick, NCI-Frederick, NIH, DHHS, Frederick, MD