EMNLP 2014: Conference on Empirical Methods in Natural Language Processing — October 25–29, 2014 — Doha, Qatar.

First Workshop on Computational Approaches to Code Switching

Introduction

Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. CS is typically present on the inter sentential, intra sentential (mixing of words from multiple languages in the same utterance) and even morphological (mixing of morphemes) levels. CS presents serious challenges for language technologies, including parsing, Machine Translation (MT), automatic speech recognition (ASR), information retrieval (IR) and extraction (IE), and semantic processing. Traditional techniques trained for one language quickly break down when there is input mixed in from another. Even for problems that are considered solved, such as language identification, or part of speech tagging, performance will degrade at a rate proportional to the amount and level of mixed-language present.

CS is pervasive in informal text communications such as news groups, tweets, blogs, and other social media of multilingual communities. Such genres are increasingly being studied as rich sources of social, commercial and political information. Apart from the informal genre challenge associated with such data within a single language processing scenario, the CS phenomenon adds another significant layer of complexity to the processing of the data. Efficiently and robustly processing CS data presents a new frontier for our NLP algorithms on all levels. This workshop aims to bring together researchers interested in solving the problem and to increase awareness of the community at large with possible viable solutions to reduce the complexity of the phenomenon.

The workshop invites contributions from researchers working in NLP approaches for the analysis and/processing of mixed-language data especially with a focus on intra sentential code switching. Topics of relevance to the workshop will include the following:

Development of linguistic resources to support research on code switched data
NLP approaches for language identification in code switched data
NLP techniques for the syntactic analysis of code switched data
Domain/dialect/genre adaptation techniques applied to code switched data processing
Language modeling approaches to code switch data processing
Crowdsourcing approaches for the annotation of code switched data
Machine translation approaches for code switched data
Position papers discussing the challenges of code switched data to NLP techniques
Methods for improving ASR in code switched data
Survey papers of NLP research for code switched data
Sociolinguistic aspects of code switching
Sociopragmatic aspects of code switching

Shared Task: Language Identification in Code-Switched (CS) Data

You thought language identification was a solved problem? Think again! Recent research has shown that fine-grained language identification is still a challenge, and is particularly error prone when the spans of text are smaller. Now imagine you have more than one language in those small text spans! We are organizing a shared task on language identification of CS data. The goal is to allow participants to explore the use of unsupervised and supervised approaches to detection of language at the word level in code-switching data. We will release a small gold standard data for tunning systems in four language pairs, Spanish-English, Modern Standard Arabic and Arabic dialects, Chinese-English and Nepalese-English.

Task Definition

For each word in the Source, identify whether it is Lang1, Lang2, Mixed, Other, Ambiguous, or NE (for named entities, which are proper names that represent names of people, places, organizations, locations, movie titles, and song titles). For more details, please see the annotation guidelines for spanish-english. The focus of the task is on microblog data, so we will use Twitter as the source of data, although each language combination will have data from a "surprise genre" as additional test data as well.

Participants for this shared task will be required to submit output of their systems following the schedule proposed below in order to qualify for evaluation under the shared task. They will also be required to submit a paper describing their system.

Since we're using Twitter data we're following the now usual procedure to release labeled data that other researchers have used. Participants can use their own scripts or download our python script to collect the data directly from Twitter and we will release char offsets with the label information.

Please join our google group to receive announcements and other relevant information for the workshop: codeswitching_workshop@googlegroups.com

To register your team please follow this link: Registration Form

Data Release

The script to crawl Twitter data is this one: twitter. You will need to have Beautiful Soup installed for this python script to work.

A second method to crawl Twitter data using the Twitter API is also available: Twitter via API. You will need to have the Launchy gem for Ruby installed, which can be done via 'gem install launchy' in the command line. You will also need a Twitter account to authenticate with the application.

For the Arabic and English-Spanish tweets, there are packages available that retrieves, tokenizes and synchronizes the tags for the training data: Arabic Tweets Token Assigner and English-Spanish Tweets Token Assigner. Instructions on how to use the packages are included.

The Spanish-English tweets were tokenized using the CMU ARK Twitter Part-of-Speech Tagger v0.3 (ignoring the parts of speech) with some later adjustments. These adjustments were made using the TweetTokenizer Perl module. The ARK Twitter tokenizer takes an entire tweet on one line, so initially run the onelineFile() subroutine on your file. Feed the output into the tokenizeFile() subroutine, which runs the tokenizer and makes adjustments. You will need to change the tokenizer location global variable in the module to your file location.

The task will be evaluated using the script and calculation library given here. The script is run using the produced offset file and the test offset file and produces a variety of evaluation metrics at the tweet and token level. See the documentation inside of the script for more details. Keep the directory structure within the Evaluation file the same for the evaluateOffsets.pl script to work properly.

The training and test data have been run through two benchmark systems to give a better idea of performance goals. The systems are a simple lexical ID approach using the training data and an off-the-shelf system, LangID, using mass amounts of monolingual tweet data.
(Ben King and Steven Abney. Labeling the languages of words in mixed-language documnts. In Proceedings of the North American Association for Computational Linguistics 2013, Atlanta.)
The results for these benchmark systems (obtained using the evaluation script) are provided below.

The shared task has now begun. The test data may be found below. Remember that the task window closes on July 27th.

For Spanish-English, Nepali-English, and Modern Standard Arabic-Arabic dialects, "suprise genre" datasets have been provided. The "suprise genre" datasets are comprised of data from Facebook, blogs, and Arabic commentaries. Because the data comes from different social media sources, the ID format varies from file to file. Unlike Twitter, you will not be given a way to crawl the data for the raw posts. Instead, each file contains the token referenced by the offsets.

Additional "surprise genre" data has been added for Spanish-English and Nepali-English as of 8/10/14.
**UPDATED 8/10/14**

To submit your results, please add the label, separated by a tab, at the end of each row of the provided test data file and submit it to coral.at.uab@gmail.com. Please do not change the order of the rows and do not add extra newlines.

Important Dates

Trial data release: March 12, 2014
Training data release: April 30, 2014
Task window: July 21-27, 2014
Results posted: August 8, 2014
Second Task window: August 13-17, 2014
Sencod Task Results posted: August 18, 2014
Workshop paper: July 29, 2014
Task papers: September 1, 2014
Notification for Workshop papers: August 26, 2014
Notification for task papers: September 5, 2014
Camera ready for workshop papers (workshop and task papers) submission deadline: September 12, 2014
Workshop Day: October 25, 2014

Submissions

The papers should be nine pages in length with an additional two pages for references. Please refer to ACL format, http://www.cs.jhu.edu/ACL2014/CallforPapers.htm. You can also download from below:

Latex
MS-Word
1. acl2014.dot
2. acl2014.pdf

Please follow this link to make a new submission: https://www.softconf.com/emnlp2014/CodeSwitch

Shared Task Paper Submission

Authors of participant systems are expected to submit a shared task paper describing their system. The task papers should be 4 pages long + 1 page for references. If your team participates in more than one language, and the systems are different, then you may add up to 2 extra pages of content per system up to a maximum length of 8 pages of content + up to 2 pages for references.

Submission system: We will use the same softconf submission system used for the workshop papers. Please follow the link above and log in with your START account.

Results

To view the results please follow these links: Results of Twitter data, Results of surprise data.

Organizing Committee

Mona Diab
Associate Professor
Department of Computer Science
George Washington University
mtdiab@email.gwu.edu

Pascale Fung
Professor
Department of Electronic and Computer Engineering
Hong Kong University of Science and Technology
pascale@ece.ust.hk

Julia Hirschberg
Professor and Chair
Department of Computer Science
Columbia University
julia@cs.columbia.edu

Thamar Solorio
Associate Professor
Department of Computer Science
University of Houston
solorio@cs.uh.edu

Program Committee

Steven Abney, University of Michigan
Laura Alonso i Alemany, Universidad Nacional de Córdoba
Rakesh Bhatt, University of Illinois at Urbana-Champaign
Elabbas Benmamoun, University of Illinois at Urbana-Champaign
Agnes Bolonyia, NC State University
Barbara Bullock, University of Texas at Austin
Suzanne Dikker, New York University
Yang Liu, University of Texas at Dallas
Aravind Joshi, University of Pennsylvania
Ben King, University of Michigan
Raymond Mooney, University of Texas at Austin
Chilin Shih, University of Illinois at Urbana-Champaign
Jacqueline Toribio, University of Texas at Austin
Omar Zaidan, Johns Hopkins University
Rabih Zbib, BBN Technologies
Owen Rambow, Columbia University
Constantine Lignos , University of Pennsylvania
Cecilia Montes-Alcala, Georgia Institute of Technology
Nizar Habash, Columbia University
Mitchell P. Marcus, University of Pennsylvania
Yves Scherrer, Université de Genève
Borja Navarro Colorado, Universidad de Alicante
Björn Gambäck, Norwegian Universities of Science and Technology
Amitava Das, University of North Texas

Contact

Thamar Solorio
Associate Professor
Department of Computer Science
University of Houston
solorio@cs.uh.edu

emnlp₂₀₁₄

REGISTER

Workshop Schedule