The Digitization Process
The Nikkei Newspapers Digital Archive (NNDA) team began with the vision of making the newspapers “keyword searchable.” A common technology used to convert physical documents into electronic media is Optical Character Recognition (“OCR”), which would enable electronic text to be searchable.
Many newspapers, most notably The New York Times, The Times of London, and locally, The Seattle Times, have been digitized by commercial vendors. Using OCR, the news pages are scanned and words or characters are “recognized” utilizing specialized software.
The team conducted initial testing using OCR software in both English and Japanese and found that the accuracy rate was too low, due primarily to the lack of clarity, typical of poor quality newsprint, but also because of the complexity of Japanese text.
In general, OCR is difficult to perform on Japanese texts for the following reasons: 1) Japanese text is written as a combination of phonetic characters known as Hiragana and Katakana as well as ideographic characters known as Kanji---of which there are well over 3,000 characters; 2) Japanese text is not separated by delimiters such as spaces, the lack of which makes computerized character recognition difficult; and 3) Many characters in the Japanese text have similar shape definition which adds to the complexity of the character recognition.
Another challenging aspect particular to this project is that the newspaper was published in ‘Old’ Kanji characters (Kyu Kanji or 旧漢字) which utilizes many more subtle and intricate strokes, further challenging the OCR software. NNDA translators struggled to identify the Old Kanji characters and decipher old grammatical syntax, requiring considerable time and effort.
In lieu of utilizing OCR, the team has been researching and testing other effective approaches to make the newspaper content accessible to the broadest readership.
This current website features digitized pages of the two newspapers. While ‘keyword searches’ are not possible, the pages are accessible and can be read by Japanese language readers.
To increase access for English readers, the team has manually created English translations of a sampling of front pages of the early Hokubei Jiji/North American Times issues. (see: “Translated Pages”). These translations provide a concise overview of the most significant stories, displayed graphically in a way that mimics the column structure of the front page of the newspaper.
Das, S., & Banerjee, S. (2014, January 1). Survey of Pattern Recognition Approaches in Japanese Character Recognition. Retrieved from http://www.ijcsit.com/docs/Volume5/vol5issue01/ijcsit2014050120.pdf
Hokubei Hochi Foundation