Empirical Methods in Natural Language Processing:
What's Happened Since the First SIGDAT Meeting?
Kenneth Ward Church
180 Park Ave
Florham Park, NJ 07932-0971
Kenneth Church is currently the head of a data mining department in AT&T Labs-Research. He received his BS, Masters and PhD from MIT in computer science in 1978, 1980 and 1983, and immediately joined AT&T Bell Labs, where he has been ever since. He has worked on many areas of computational linguistics including: acoustics, speech recognition, speech synthesis, OCR, phonetics, phonology, morphology, word-sense disambiguation, spelling correction, terminology, translation, lexicography, information retrieval, compression, language modeling and text analysis. He enjoys working with very large corpora such as the Associated Press newswire (1 million words per week). His datamining department is currently applying similar methods to much much larger data sets such a telephone call detail (1-10 billion records per month).
The first workshop on Very Large Corpora was held just before the 1993 ACL meeting in Columbus Ohio. The turnout was even greater than anyone could have predicted (or else we would have called the meeting a conference rather than a workshop). We knew that text analysis was a ``hot area,'' but we didn't appreciate just how hot it would turn out to be.
The 1990s were witnessing a resurgence of interest in 1950s-style empirical and statistical methods of language analysis. Empiricism was at its peak in the 1950s, dominating a broad set of fields ranging from psychology (behaviorism) to electrical engineering (information theory). At that time, it was common practice in linguistics to classify words not only on the basis of their meanings but also on the basis of their co-occurrence with other words. Firth, a leading figure in British linguistics during the 1950s, summarized the approach with the memorable line: You shall know a word by the company it keeps. Regrettably, interest in empiricism faded in the late 1950s and early 1960s with a number of significant events including Chomskys criticism of n-grams in Syntactic Structures (Chomsky, 1957) and Minsky and Paperts criticism of neural networks in Perceptrons (Minsky and Papert, 1969).
Perhaps the most immediate reason for this empirical renaissance is the availability of massive quantities of data: text is available like never before. Just ten years earlier, the one-million word Brown Corpus (Francis and Kucera, 1982) was considered large, but these days, everyone has access to the web. Experiments are routinely carried out on many gigabytes of text. Some researchers are even working with terabytes.
The big difference since the first SIGDAT meeting in 1993 is that large corpora are now having a big impact on ordinary users. Web search engines/portals are an obvious example. Managing gigabytes is not only the title of a popular (Moffat, Bell and Witten, 1999), but it is something that ordinary users are beginning to take for granted. Recent progress in Information Retrieval and Digital Libraries was worth a fortune (when stock prices were at their peak). Speech Recognition and Machine Translation are also changing the world. If you walk into any software store these days, you will find a shelf full of speech recognition and machine translation products. And it is getting so you cant use the telephone these days without talking to a computer.