Share Paper: Building Very Large Corpus Containing Useful Rich Materials for Language Learning from Closed Caption TV

  1. Hajime Mochizuki, Tokyo University of Foreign Studies, Japan
  2. Kohji Shibano, Tokyo University of Foreign Studies, Japan
Tuesday, October 28 2:00-2:30 PM Edgewood

Abstract: This paper describes the specific details of a very large spoken language corpus constructed from closed caption TV data. We collected the closed caption data from over 70,000 TV programs from January 2013 to June 2014. The total number of words in our corpus has reached over 280 million morphemes. Once we obtain a larger corpus, we will be able to use it as a language resource in various types of research. Because TV is a major medium in daily life, we expect to be able to apply the corpus to language education such as an e-learning system. We also ...