Previous |  Up |  Next

Article

Title: Towards Reverse Engineering of PDF Documents (English)
Author: Baker, Josef B.
Author: Sexton, Alan P.
Author: Sorge, Volker
Language: English
Journal: Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011
Volume:
Issue: 2011
Year:
Pages: 65-75
.
Category: math
.
Summary: We present a progress report on our ongoing project of reverse engineering scientific PDF documents. The aim is to obtain mathematical markup that can be used as source for regenerating a document that resembles the original as closely as possible. This source can then be a basis for further document processing. Our current tool uses specialised PDF extraction together with image analysis to produce near perfect input for parsing mathematical formula. Applying a linear grammar and specific drivers for each output format to this input, we can produce an accurate reproduction of formulae when presented with their coordinates. In this paper we will show how this information can be exploited to discover the locations of both inline and display formulae, and also to perform rudimentary layout analysis of the whole document, identifying structures such as headings and paragraphs. (English)
MSC: 68-06
MSC: 68U10
MSC: 68U15
MSC: 68U99
.
Date available: 2011-07-15T09:28:38Z
Last updated: 2012-08-27
Stable URL: http://hdl.handle.net/10338.dmlcz/702603
.
Reference: 1. Anderson, R.H.: Syntax-Directed Recognition of Hand-Printed Two-dimensional Mathematics.Ph.D. thesis, Harvard University, Cambridge, MA (1968). Zbl 0207.17806
Reference: 2. Baker, J., Sexton, A., Sorge, V.: A linear grammar approach to mathematical formula recognition from PDF.In: Proceedings of Intelligent Computer Mathematics (2009).
Reference: 3. Baker, J., Sexton, A.P., Sorge, V., Suzuki, M.: Comparing approaches to mathematical document analysis.In: 11th International Conference on Document Analysis and Recognition (to appear) (2011).
Reference: 4. Baker, J., Sexton, A., Sorge, V.: Faithful mathematical formula recognition from PDF documents.In: 9th IAPR International Workshop on Document Analysis Systems, Extended Abstracts. pp. 485–492. ACM Press, Boston, USA (2010).
Reference: 5. Garain, U.: Identification of mathematical expressions in document images.In: Document Analysis and Recognition, International Conference on. pp. 1340–1344. IEEE Computer Society, Los Alamitos, CA, USA (2009).
Reference: 6. Mittelbach, F., Goossens, M.: The LaTeX Companion.Pearson Education, 2e edn. (2005), TeX spacing table, page 525.
Reference: 7. Sternberg, S.: Theory of functions of a real variable.(2005), http://www.math.harvard.edu/~shlomo/docs/Real_Variables.pdf
Reference: 8. Suzuki, M., Uchida, S., Nomura, A.: A ground-truthed mathematical character and symbol image database.In: Proc. of ICDAR. pp. 675–679. IEEE Computer Society (2005).
Reference: 9. Suzuki, M.: Infty.(2011), http://www.inftyproject.org
.

Files

Files Size Format View
DML_004-2011-1_10.pdf 387.4Kb application/pdf View/Open
Back to standard record
Partner of
EuDML logo