|
|
The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Held at the Portland
Marriott Downtown Waterfront in
Portland, Oregon, USA, June 19-24, 2011
|
Web Search Queries as a CorpusPRESENTER: Marius Pasca
ABSTRACT:
Web search queries are often little more than short, keyword-based
approximations of underspecified information needs. But as noisy
and unreliable as they may be, queries indirectly convey knowledge
just as they request knowledge. Indeed, queries specify constraints,
even if as brittle as the mere presence of an additional keyword or
phrase, that loosely describe what knowledge is being requested. In the
process of asking "how many calories are burned during skiing", one
implies that skiing may be an activity during which the body consumes
calories, whereas the more condensed "amg latest album" still suggests
that amg may be a musician or band, even to someone unfamiliar with the
respective topics. As such, search queries are cursory reflections of
knowledge encoded deeply within unstructured and structured content
available in documents on the Web and elsewhere.
The notion that inherently-noisy Web search queries may collectively
serve as a text corpus is an intriguing alternative to using document
corpora. This tutorial gives an overview of the characteristics of,
and types of knowledge available in, queries as a corpus. It reviews
extraction methods developed recently for extracting such knowledge.
Considering the building blocks that would contribute towards the automatic
construction of knowledge bases, queries lend themselves as a useful
data source in the acquisition of classes of instances (e.g., palo
alto, santa barbara, twentynine palms), where the classes are unlabeled
or labeled (e.g., california cities), possibly organized as hierarchies
of search intents; as well as relations, including class attributes
(e.g., population density, mayor).
The tutorial covers characteristics of search queries, when considered
as an input data source in open-domain information extraction, and
their impact on extraction methods operating over queries as opposed
to documents; types of knowledge for which queries lend themselves as
a useful data source in information extraction; detailed methods for
extracting classes, instances and relations from queries; and
implications in semantic annotation of queries, understanding query
intent, and information access and retrieval in general.
OUTLINE
* Introduction
- Overview of knowledge acquisition from text
- Goals of open-domain information extraction
- Extraction from documents vs. queries
* Queries as a corpus
- Intrinsic aspects: distribution, lexical structure
- Extrinsic aspects: temporality, demographics
- Beyond individual queries: sessions, clicks
* Methods for knowledge acquisition from queries
- Extraction of instances and classes
- Extraction of attributes and relations
* Discussion
- Implications and limitations
- Applications
PRESENTER BIO
Marius Pasca
Google Inc.
1600 Amphitheatre Parkway
Mountain View, California 94043
Email: mars(-at-)google.com
Marius Pasca is a research scientist at Google. He graduated with a
Ph.D. degree in Computer Science from Southern Methodist University,
Dallas, Texas and an M.Sc. degree in Computer Science from Joseph Fourier
University, Grenoble, France. He is the author of the book "Open-domain
question answering from large text collections". He served on the program
committees of ACL, IJCAI, WWW, SIGIR, HLT, EMNLP, NAACL and AAAI,
including area co-chair positions at HTL-06, CIKM-08 and EMNLP-09. Current
research interests include factual information extraction from unstructured
text within documents and queries, and its applications to Web search.
| |