First Workshop on Computational Approaches to Code Switching
Introduction
Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. CS is typically present on the inter sentential, intra sentential (mixing of words from multiple languages in the same utterance) and even morphological (mixing of morphemes) levels. CS presents serious challenges for language technologies, including parsing, Machine Translation (MT), automatic speech recognition (ASR), information retrieval (IR) and extraction (IE), and semantic processing. Traditional techniques trained for one language quickly break down when there is input mixed in from another. Even for problems that are considered solved, such as language identification, or part of speech tagging, performance will degrade at a rate proportional to the amount and level of mixed-language present.
CS is pervasive in informal text communications such as news groups, tweets, blogs, and other social media of multilingual communities. Such genres are increasingly being studied as rich sources of social, commercial and political information. Apart from the informal genre challenge associated with such data within a single language processing scenario, the CS phenomenon adds another significant layer of complexity to the processing of the data. Efficiently and robustly processing CS data presents a new frontier for our NLP algorithms on all levels. This workshop aims to bring together researchers interested in solving the problem and to increase awareness of the community at large with possible viable solutions to reduce the complexity of the phenomenon.
The workshop invites contributions from researchers working in NLP approaches for the analysis and/processing of mixed-language data especially with a focus on intra sentential code switching. Topics of relevance to the workshop will include the following:
- Development of linguistic resources to support research on code switched data
- NLP approaches for language identification in code switched data
- NLP techniques for the syntactic analysis of code switched data
- Domain/dialect/genre adaptation techniques applied to code switched data processing
- Language modeling approaches to code switch data processing
- Crowdsourcing approaches for the annotation of code switched data
- Machine translation approaches for code switched data
- Position papers discussing the challenges of code switched data to NLP techniques
- Methods for improving ASR in code switched data
- Survey papers of NLP research for code switched data
- Sociolinguistic aspects of code switching
- Sociopragmatic aspects of code switching
Shared Task: Language Identification in Code-Switched (CS) Data
You thought language identification was a solved problem? Think again! Recent research has shown that fine-grained language identification is still a challenge, and is particularly error prone when the spans of text are smaller. Now imagine you have more than one language in those small text spans! We are organizing a shared task on language identification of CS data. The goal is to allow participants to explore the use of unsupervised and supervised approaches to detection of language at the word level in code-switching data. We will release a small gold standard data for tunning systems in four language pairs, Spanish-English, Modern Standard Arabic and Arabic dialects, Chinese-English and Nepalese-English.
Task Definition
For each word in the Source, identify whether it is Lang1, Lang2, Mixed, Other, Ambiguous, or NE (for named entities, which are proper names that represent names of people, places, organizations, locations, movie titles, and song titles). For more details, please see the annotation guidelines for spanish-english. The focus of the task is on microblog data, so we will use Twitter as the source of data, although each language combination will have data from a "surprise genre" as additional test data as well.
Participants for this shared task will be required to submit output of their systems following the schedule proposed below in order to qualify for evaluation under the shared task. They will also be required to submit a paper describing their system.
Since we're using Twitter data we're following the now usual procedure to release labeled data that other researchers have used. Participants can use their own scripts or download our python script to collect the data directly from Twitter and we will release char offsets with the label information.
Please join our google group to receive announcements and other relevant information for the workshop: codeswitching_workshop@googlegroups.com
To register your team please follow this link: Registration Form
Data Release
The script to crawl Twitter data is this one: twitter. You will need to have Beautiful Soup installed for this python script to work.
A second method to crawl Twitter data using the Twitter API is also
available: Twitter
via API. You will need to have the Launchy gem for Ruby installed,
which can be done via 'gem install launchy' in the command line. You
will also need a Twitter account to authenticate with the
application.
For the Arabic and English-Spanish tweets, there are packages
available that retrieves, tokenizes and synchronizes the tags for
the training data: Arabic
Tweets Token Assigner and English-Spanish
Tweets Token Assigner. Instructions on how to use the packages are
included.
The Spanish-English tweets were tokenized using the CMU ARK Twitter Part-of-Speech Tagger v0.3 (ignoring the parts of speech) with some later adjustments. These adjustments were made using the TweetTokenizer Perl module. The ARK Twitter tokenizer takes an entire tweet on one line, so initially run the onelineFile() subroutine on your file. Feed the output into the tokenizeFile() subroutine, which runs the tokenizer and makes adjustments. You will need to change the tokenizer location global variable in the module to your file location.
- Nepalese-English Trial data (20 tweets)
- Spanish-English Trial data (20 tweets)
- Mandarin-English Trial data (20 tweets)
- Modern Standard Arabic-Arabic dialects (20 tweets)
- Spanish-English Training data (11,400 tweets)
- Nepali-English Training data (9993 tweets, updated 16th July, 2014)
- Modern Standard Arabic-Arabic dialects Training data (5,838 tweets)
- Mandarin-English Training data (1,000 tweets)
The task will be evaluated using the script and calculation library
given here.
The script is run using the produced offset file and the test offset
file and produces a variety of evaluation metrics at the tweet and
token level. See the documentation inside of the script for more
details. Keep the directory structure within the Evaluation file the
same for the evaluateOffsets.pl script to work properly.
The training and test data have been run through two benchmark
systems to give a better idea of performance goals. The systems are
a simple lexical ID approach using the training data and an
off-the-shelf system, LangID, using mass amounts of monolingual
tweet data.
(Ben King and Steven Abney. Labeling the
languages of words in mixed-language documnts. In Proceedings of the
North American Association for Computational Linguistics 2013,
Atlanta.)
The results for these benchmark systems
(obtained using the evaluation script) are provided below.
The shared task has now begun. The test data may be found below. Remember that the task window closes on July 27th.
- Spanish-English Test data (3,060 tweets)
- Nepali-English Test data (3,018 tweets )
- Modern Standard Arabic-Arabic dialects Test data (2,363 tweets)
- Mandarin-English Test data (316 tweets)
- Modern Standard Arabic-Arabic Dialects Second Test data (1,777 tweets)
For Spanish-English, Nepali-English, and Modern Standard Arabic-Arabic dialects, "suprise genre" datasets have been provided. The "suprise genre" datasets are comprised of data from Facebook, blogs, and Arabic commentaries. Because the data comes from different social media sources, the ID format varies from file to file. Unlike Twitter, you will not be given a way to crawl the data for the raw posts. Instead, each file contains the token referenced by the offsets.
- Spanish-English "Suprise Genre" Test data (1,103 tokens)
- Nepali-English "Suprise Genre" Test data (1,087 tokens)
- Modern Standard Arabic-Arabic dialects "Suprise Genre" Test data (12,018 tokens)
Additional "surprise genre" data has been added for Spanish-English
and Nepali-English as of 8/10/14.
**UPDATED 8/10/14**
To submit your results, please add the label, separated by a tab, at the end of each row of the provided test data file and submit it to coral.at.uab@gmail.com. Please do not change the order of the rows and do not add extra newlines.
Important Dates
- Trial data release: March 12, 2014
- Training data release: April 30, 2014
- Task window: July 21-27, 2014
- Results posted: August 8, 2014
- Second Task window: August 13-17, 2014
- Sencod Task Results posted: August 18, 2014
- Workshop paper: July 29, 2014
- Task papers: September 1, 2014
- Notification for Workshop papers: August 26, 2014
- Notification for task papers: September 5, 2014
- Camera ready for workshop papers (workshop and task papers) submission deadline: September 12, 2014
- Workshop Day: October 25, 2014
Submissions
The papers should be nine pages in length with an additional two pages for references. Please refer to ACL format, http://www.cs.jhu.edu/ACL2014/CallforPapers.htm. You can also download from below:
- Latex
- MS-Word
Please follow this link to make a new submission: https://www.softconf.com/emnlp2014/CodeSwitch
Results
To view the results please follow these links: Results of Twitter data, Results of surprise data.
Organizing Committee
- Mona Diab
- Associate Professor
- Department of Computer Science
- George Washington University
- mtdiab@email.gwu.edu
- Pascale Fung
- Professor
- Department of Electronic and Computer Engineering
- Hong Kong University of Science and Technology
- pascale@ece.ust.hk
- Julia Hirschberg
- Professor and Chair
- Department of Computer Science
- Columbia University
- julia@cs.columbia.edu
- Thamar Solorio
- Associate Professor
- Department of Computer Science
- University of Houston
- solorio@cs.uh.edu
Program Committee
- Steven Abney, University of Michigan
- Laura Alonso i Alemany, Universidad Nacional de Córdoba
- Rakesh Bhatt, University of Illinois at Urbana-Champaign
- Elabbas Benmamoun, University of Illinois at Urbana-Champaign
- Agnes Bolonyia, NC State University
- Barbara Bullock, University of Texas at Austin
- Suzanne Dikker, New York University
- Yang Liu, University of Texas at Dallas
- Aravind Joshi, University of Pennsylvania
- Ben King, University of Michigan
- Raymond Mooney, University of Texas at Austin
- Chilin Shih, University of Illinois at Urbana-Champaign
- Jacqueline Toribio, University of Texas at Austin
- Omar Zaidan, Johns Hopkins University
- Rabih Zbib, BBN Technologies
- Owen Rambow, Columbia University
- Constantine Lignos , University of Pennsylvania
- Cecilia Montes-Alcala, Georgia Institute of Technology
- Nizar Habash, Columbia University
- Mitchell P. Marcus, University of Pennsylvania
- Yves Scherrer, Université de Genève
- Borja Navarro Colorado, Universidad de Alicante
- Björn Gambäck, Norwegian Universities of Science and Technology
- Amitava Das, University of North Texas
Contact
- Thamar Solorio
- Associate Professor
- Department of Computer Science
- University of Houston
- solorio@cs.uh.edu