TR94-07

Deterministic Part-of-Speech Tagging with Finite State Transducers


    •  Emmanuel Roche, Yves Schabes, "Deterministic Part-of-Speech Tagging with Finite State Transducers", Tech. Rep. TR94-07, Mitsubishi Electric Research Laboratories, Cambridge, MA, May 1994.
      BibTeX TR94-07 PDF
      • @techreport{MERL_TR94-07,
      • author = {Emmanuel Roche, Yves Schabes},
      • title = {Deterministic Part-of-Speech Tagging with Finite State Transducers},
      • institution = {MERL - Mitsubishi Electric Research Laboratories},
      • address = {Cambridge, MA 02139},
      • number = {TR94-07},
      • month = may,
      • year = 1994,
      • url = {https://www.merl.com/publications/TR94-07/}
      • }
Abstract:

Stochastic approaches to natural language processing have often been preferred to rule-based approaches because of their robustness and their automatic training capabilities. This was the case for part-of-speech tagging until Brill showed how state of the art part-of-speech tagging can be achieved by inferring a rule-based part-of-speech tagger from a training corpus. However current implementations of Brill\'s tagger run more slowly than previous approaches. In this paper, we present a finite-state tagger inspired by Brill\'s work which operates in optimal time in the sense that the time to assign tags to a sentence corresponds to the time required to deterministically follow a single path in a deterministic finite state machine. This result is achieved by encoding the application of the rules found in Brill\'s tagger as a non-deterministic finite state transducer and then turning it into a deterministic transducer. The resulting deterministic transducer yields a part-of-speech tagger whose speed is dominated by the access time of mass storage devices.