The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Held at the Portland Marriott Downtown Waterfront in
Portland, Oregon, USA, June 19-24, 2011


Web Search Queries as a Corpus

PRESENTER: Marius Pasca ABSTRACT: Web search queries are often little more than short, keyword-based approximations of underspecified information needs. But as noisy and unreliable as they may be, queries indirectly convey knowledge just as they request knowledge. Indeed, queries specify constraints, even if as brittle as the mere presence of an additional keyword or phrase, that loosely describe what knowledge is being requested. In the process of asking "how many calories are burned during skiing", one implies that skiing may be an activity during which the body consumes calories, whereas the more condensed "amg latest album" still suggests that amg may be a musician or band, even to someone unfamiliar with the respective topics. As such, search queries are cursory reflections of knowledge encoded deeply within unstructured and structured content available in documents on the Web and elsewhere. The notion that inherently-noisy Web search queries may collectively serve as a text corpus is an intriguing alternative to using document corpora. This tutorial gives an overview of the characteristics of, and types of knowledge available in, queries as a corpus. It reviews extraction methods developed recently for extracting such knowledge. Considering the building blocks that would contribute towards the automatic construction of knowledge bases, queries lend themselves as a useful data source in the acquisition of classes of instances (e.g., palo alto, santa barbara, twentynine palms), where the classes are unlabeled or labeled (e.g., california cities), possibly organized as hierarchies of search intents; as well as relations, including class attributes (e.g., population density, mayor). The tutorial covers characteristics of search queries, when considered as an input data source in open-domain information extraction, and their impact on extraction methods operating over queries as opposed to documents; types of knowledge for which queries lend themselves as a useful data source in information extraction; detailed methods for extracting classes, instances and relations from queries; and implications in semantic annotation of queries, understanding query intent, and information access and retrieval in general. OUTLINE * Introduction - Overview of knowledge acquisition from text - Goals of open-domain information extraction - Extraction from documents vs. queries * Queries as a corpus - Intrinsic aspects: distribution, lexical structure - Extrinsic aspects: temporality, demographics - Beyond individual queries: sessions, clicks * Methods for knowledge acquisition from queries - Extraction of instances and classes - Extraction of attributes and relations * Discussion - Implications and limitations - Applications PRESENTER BIO Marius Pasca Google Inc. 1600 Amphitheatre Parkway Mountain View, California 94043 Email: mars(-at-)google.com Marius Pasca is a research scientist at Google. He graduated with a Ph.D. degree in Computer Science from Southern Methodist University, Dallas, Texas and an M.Sc. degree in Computer Science from Joseph Fourier University, Grenoble, France. He is the author of the book "Open-domain question answering from large text collections". He served on the program committees of ACL, IJCAI, WWW, SIGIR, HLT, EMNLP, NAACL and AAAI, including area co-chair positions at HTL-06, CIKM-08 and EMNLP-09. Current research interests include factual information extraction from unstructured text within documents and queries, and its applications to Web search.



acl2011.conference@gmail.com   ♦   Oregon Health & Science University