TR94-07

Deterministic Part-of-Speech Tagging with Finite State Transducers

- Emmanuel Roche, Yves Schabes, "Deterministic Part-of-Speech Tagging with Finite State Transducers", Tech. Rep. TR94-07, Mitsubishi Electric Research Laboratories, Cambridge, MA, May 1994.
  BibTeX TR94-07 PDF
  - @techreport{MERL_TR94-07,
  - author = {Emmanuel Roche, Yves Schabes},
  - title = {Deterministic Part-of-Speech Tagging with Finite State Transducers},
  - institution = {MERL - Mitsubishi Electric Research Laboratories},
  - address = {Cambridge, MA 02139},
  - number = {TR94-07},
  - month = may,
  - year = 1994,
  - url = {https://www.merl.com/publications/TR94-07/}
  - }

Abstract:

Stochastic approaches to natural language processing have often been preferred to rule-based approaches because of their robustness and their automatic training capabilities. This was the case for part-of-speech tagging until Brill showed how state of the art part-of-speech tagging can be achieved by inferring a rule-based part-of-speech tagger from a training corpus. However current implementations of Brill\'s tagger run more slowly than previous approaches. In this paper, we present a finite-state tagger inspired by Brill\'s work which operates in optimal time in the sense that the time to assign tags to a sentence corresponds to the time required to deterministically follow a single path in a deterministic finite state machine. This result is achieved by encoding the application of the rules found in Brill\'s tagger as a non-deterministic finite state transducer and then turning it into a deterministic transducer. The resulting deterministic transducer yields a part-of-speech tagger whose speed is dominated by the access time of mass storage devices.