| Title:
             | 
Workflow of Metadata Extraction from Retro-Born Digital Documents (English) | 
| Author:
             | 
Tkaczyk, Dominika | 
| Author:
             | 
Bolikowski, Łukasz | 
| Language:
             | 
English | 
| Journal:
             | 
Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011 | 
| Volume:
             | 
 | 
| Issue:
             | 
2011 | 
| Year:
             | 
 | 
| Pages:
             | 
39-44 | 
| . | 
| Category:
             | 
math | 
| . | 
| Summary:
             | 
In this work-in-progress report we propose a workflow for metadata extraction from articles in a digital form. We decompose the problem into clearly defined sub-tasks and outline possible implementations of the sub-tasks. We report the progress of implementation and tests, and state future work. (English) | 
| Keyword:
             | 
metadata extraction | 
| Keyword:
             | 
page segmentation | 
| Keyword:
             | 
zone classification | 
| Keyword:
             | 
Hidden Markov Model | 
| MSC:
             | 
68-06 | 
| MSC:
             | 
68U10 | 
| MSC:
             | 
68U15 | 
| MSC:
             | 
68U99 | 
| . | 
| Date available:
             | 
2011-07-15T09:26:55Z | 
| Last updated:
             | 
2012-08-27 | 
| Stable URL:
             | 
http://hdl.handle.net/10338.dmlcz/702601 | 
| . | 
| Reference:
             | 
1. 
		: iText.http://itextpdf.com/. | 
| Reference:
             | 
2. 
		: MARG.http://marg.nlm.nih.gov/. Zbl 1143.68407 | 
| Reference:
             | 
3. 
		: PDFBox.http://pdfbox.apache.org/ | 
| Reference:
             | 
4. 
		: Automating the production of bibliographic records for MEDLINE.Tech. rep. (2001). | 
| Reference:
             | 
5. Cui, B., Chen, X.: An improved hidden Markov model for literature metadata extraction.Advanced Intelligent Computing Theories and Applications. pp. 205–212 (2010). | 
| Reference:
             | 
6. Hetzner, E.: A simple method for citation metadata extraction using Hidden Markov Models.In: JCDL ’08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries. pp. 280–284. ACM, New York, NY, USA (2008). | 
| Reference:
             | 
7. Marinai, S.: Metadata Extraction from PDF Papers for Digital Library Ingest.10th International Conference on Document Analysis and Recognition. pp. 251–255 (2009). | 
| Reference:
             | 
8. Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals.Computer 25(7), 10–22 (1992). | 
| Reference:
             | 
9. O’Gorman, L.: The document spectrum for page layout analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993). | 
| Reference:
             | 
10. Sojka, P.: An Experience with Building Digital Open Access Repository DML-CZ.In: Proceedings of CASLIN 2009. pp. 74–78 (2009). | 
| Reference:
             | 
11. Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning.(2006). | 
| . |