Allgemeines
Architekturarchive
Archivbau
Archivbibliotheken
Archive in der Zukunft
Archive von unten
Archivgeschichte
Archivpaedagogik
Archivrecht
Archivsoftware
Ausbildungsfragen
Bestandserhaltung
Bewertung
Bibliothekswesen
Bildquellen
Datenschutz
... weitere
Profil
Abmelden
Weblog abonnieren
null

 
Holley, Rose: How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs
http://www.dlib.org/dlib/march09/holley/03holley.html

Basic OCR correction by public users was implemented and tested in the prototype search system released to State and Territory Libraries for testing in December 2007. User correction of text was positively received, though most Libraries asked if and how moderation would take place. It was then implemented in the Beta search system (without moderation), which had a soft release to the public without any publicity on 25 July 2008. In the first three months of use (July - October 2008) the public immediately began correcting OCR. We have found it quite hard to monitor what they are doing, how well they are doing it, and how it is affecting the overall quality of the data, since moderation is not yet in place and login to do it is not mandatory (it is optional) at this stage. We also have had difficulties measuring the accuracy of the OCR-corrected text. We have three methods of measuring text correction: number of lines corrected, number of correction "transactions" (i.e., pressing the "save corrections" button), and number of different articles corrected. However, it is questionable how useful any of the three methods are. We are assuming that all correction transactions are to improve text and make it right. No extra text can be added, only existing lines corrected. No text has been deliberately incorrectly changed as far as we are aware.

The results of user activity within the first 12 weeks of the soft launch (without publicity) are that 868 registered users have corrected text and approximately 390 unregistered users (total of 1,200 text correctors). 700,000 lines of text have been corrected within 50,000 articles. The top text corrector has corrected 50,000 lines of text within nearly 2,000 individual articles. Some articles have had corrections added by more than seven users (e.g., articles in the first Australian newspaper the 1803 Sydney Gazette). This particular issue in its entirety has had several different users working on corrections, because it is difficult to read and is an important newspaper.

User feedback returned via surveys, e-mails, phone calls and the "contact us" form has been overwhelmingly positive and interesting. Users did not expect to be able to correct OCR text. Once they discovered they could, they quickly took to the concept and method, and several reported finding correcting the text both addictive and rewarding. Users were actively correcting much more than they or we had expected to correct. In addition, our own users have the potential to achieve a 100% accuracy rate with their knowledge of English, history and context, whereas our contractors are only achieving an accuracy of 99.5% in the title headings.


See also
Holley, Rose (2009) Many Hands Make Light Work: Public Collaborative Text Correction in Australian Historic Newspapers. ISBN 978-0-642-27694-0. Available at http://www.nla.gov.au/ndp/project_details/documents/ANDP_ManyHands.pdf

Excerpt:

The Australian Newspapers beta service has clearly demonstrated that users want to engage and be
involved with full text newspaper data in new and exciting ways. The use of web 2.0 technologies can
enable this. Without publicity, ‘how‐to’ tutorials or even a familiar and refined interface or concept,
the service still rapidly harnessed an active group of users who are enthusiastically enhancing and
improving the data by use of the text correction, tagging and comments functions. Users have
demonstrated a willingness to work towards the ‘common good’, to volunteer their time, energy, skill,
knowledge and ideas and to be involved long term in a program of national historic significance. The
collaborative activity from this new community is enhancing the quality of the data and therefore the
accuracy of full‐text searching in a way that the National Library of Australia could never have
achieved using its own resources alone.
 

twoday.net AGB

xml version of this page

powered by Antville powered by Helma