Competition to automate text recognition for printed Bangla books
The Two Centuries of Indian Print project are running a competition, at ICDAR 2019, for automated text recognition of rare and unique printed books written in Bangla that have been digitised through the Library’s Two Centuries of Indian Print project.
This is the second time we are running this competition. Some of you may remember the Bangla printed books competition which took place at ICDAR2017 which generated significant interest among academic institutions and technology providers both in India and across the world. The 2017 competition set the challenge of finding an optimal solution for automating recognition of Bangla printed text and resulted in Google’s method performing best for both text detection and layout analysis.
Fast forward to 2019 and, thanks to Jadavpur University in Kolkata, we have added more ground truth transcriptions for competition entrants to train their OCR systems with. We hope that this second competition encourages submissions again from cutting-edge OCR methods leading to a solution that can truly open up these historic books, dating between 1713 and 1914, for text mining, enabling scholars of South Asian studies to explore hundreds of thousands of pages on a scale that has not been possible until now.
Image showing a transcribed page from one of the Bengali books featured in the ICDAR2019 competition.
We are collaborating with PRImA (Pattern Recognition & Image Analysis Research Lab) who will provide expert and objective evaluation of OCR results produced through the competition. The final results will be revealed at the ICDAR2019 conference in Sydney in September 2019.
So if you missed out last time but are interested in testing your OCR systems on our books, or you want to have a go at trying again, the competition is now open!
For instructions of how to apply and more about the competition, please visit https://www.primaresearch.org/REID2019/
This post is by Tom Derrick, Digital Curator for Two Centuries of Indian Print, British Library.
He is on Twitter as @TommyID83 and Two Centuries of Indian Print tweet from @BL_IndianPrint