Code-switching Research Resources
This is the list of tutorials, workshops, papers, and resources on computational linguistic approaches to code-switching research. The list will be updated over the time. You are welcome to send a pull request for updating the list and be one of the contributors!
📌 I plan to collect theses and books on code-switching and list them here. If you have one, don't hesitate to contact me or send a pull request!
🚀 Highlights
- If you are new on code-switching or looking for a new research direction, we have written a comprehensive survey paper on code-switching: The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges [Paper]. Feel free to read and let us know if you have any suggestions! Thanks to Alham Fikri Aji, Zheng-Xin Yong, and Thamar Solorio to make this possible 😊
- We are organizing the code-switching workshop at EMNLP 2023! [Website]
- We (I, Marina Zhukova, and Sudipta Kar) organized a bird-of-a-feather session at EMNLP 2022 in Abu Dhabi. We have around 30 people joining (in-person and online). Thanks for coming!
- 📔 There was a comprehensive tutorial about code-mixing by Microsoft Research (Monojit Choudhury, Kalika Bali, Anirudh Srinivasan, and Sandipan Dandapat) at EMNLP 2019, you can check the following link.
🏫 Workshops
This is the list of the code-switching workshop series:
- First Workshop on Computational Approaches to Code-switching, EMNLP 2014 [Website]
- Second Workshop on Computational Approaches to Code-switching, EMNLP 2016
- Third Workshop on Computational Approaches to Linguistic Code-switching, ACL 2018 [Website]
- Fourth Workshop on Computational Approaches to Linguistic Code-switching, LREC 2020 [Website]
- First Workshop on Speech Technologies for Code-switching in Multilingual Communities, Interspeech 2020 [Website]
- Fifth Workshop on Computational Approaches to Linguistic Code-switching, NAACL 2021 [Website]
- Sixth Workshop on Computational Approaches to Linguistic Code-switching, EMNLP 2023 [Website]
📑 Research Papers
Survey Paper
- Winata, et al. (2023) The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges. ACL Findings [Paper]
- Doğruöz, et al (2021) A Survey of Code-switching: Linguistic and Social Perspectives for Language Technologies. ACL [Paper]
- Jose, et al. (2020) A Survey of Current Datasets for Code-Switching Research. International Conference on Advanced Computing and Communication Systems (ICACCS) [Paper]
- Sitaram, et al. (2019) A Survey of Code-switched Speech and Language Processing. Arxiv [Paper]
Large Language Models
- Yong, et al. (2023) Prompting Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages. Arxiv [Paper]
Language Identification and POS Tagging
- Ostapenko, et al. (2022) Speaker Information Can Guide Models to Better Inductive Biases: A Case Study On Predicting Code-Switching. ACL [Paper]
- Nguyen, et al. (2021) Automatic Language Identification in Code-Switched Hindi-English Social Media Text. Journal of Open Humanities Data [Paper]
- Tarunesh, et al. (2021) From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text. ACL [Paper]
- Gustavo Aguilar and Thamar Solorio. (2020) From English to Code-Switching: Transfer Learning with Strong Morphological Clues. ACL [Paper] [Code]
- Mager, et al. (2019) Subword-Level Language Identification for Intra-Word Code-Switching. NAACL [Paper]
- Zhang, et al. (2018) A Fast, Compact, Accurate Model for Language Identification of Codemixed Text. EMNLP [Paper]
- Kelsey Ball and Dan Garrette. (2018) Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification. EMNLP [Paper]
- Zeynep Yirmibesoglu and Gulsen Eryigit. (2018) Detecting Code-Switching between Turkish-English Language Pair. Workshop W-NUT, EMNLP [Paper]
- Mavem, et al. (2018) Language Identification and Analysis of Code-Switched Social Media Text. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
- Victor Soto and Julia Hirschberg. (2018) Joint Part-of-Speech and Language ID Tagging for Code-Switched Data. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
- Bullock, et al. (2018) Predicting the presence of a Matrix Language in code-switching. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
- Soto, et al. (2018) The Role of Cognate Words, POS Tags, and Entrainment in Code-Switching. Interspeech [Paper]
- Barman, et al. (2016) Part-of-speech Tagging of Code-mixed Social Media Content: Pipeline,Stacking and Joint Modelling. 2nd Workshop on Computational Approaches to Code-Switching, ACL [Paper]
- Vyas, et al. (2014) POS Tagging of English-Hindi Code-Mixed Social Media Content. EMNLP [Paper]
- Heba Elfardy and Mona Diab. (2012) Token Level Identification of Linguistic Code Switching. COLING [Paper]
- Thamar Solorio and Yang Liu. (2008) Learning to Predict Code-Switching Points. EMNLP [Paper]
- Dau-Cheng Lyu and Ren-Yuan Lyu. (2008) Language Identification on Code-Switching Utterances Using Multiple Cues. Interspeech [Paper]
Corpus
- Whitehouse, et al. (2022) EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching. EMNLP [Paper] [Code]
- Lovenia, et al. (2022) ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation. LREC [Paper] [Dataset]
- Nguyen, et al. (2020) CanVEC-the Canberra Vietnamese-English Code-switching Natural Speech Corpus. LREC [Paper]
- Umapathy, et al. (2020) Investigating Modelling Techniques for Natural Language Inference on Code-Switched Dialogues in Bollywood Movies. First Workshop on Speech Technologies for Code-switching in Multilingual Communities, Interspeech 2020 [Dataset]
- Xiang, et al. (2020) Sina Mandarin Alphabetical Words:A Web-driven Code-mixing Lexical Resource. AACL-IJCNLP [TBC]
- Chakravarthi, et al. (2020) Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text. Spoken Language Technologies for Under-resourced languages) and CCURL (Collaboration and Computing for Under-Resourced Languages Workshop, LREC [Paper]
- Khanuja, et al. (2020) A New Dataset for Natural Language Inference from Code-mixed Conversations. 4th Workshop of Computational Approaches to Linguistic Code-switching, LREC [Paper]
- Barik, et al. (2019) Normalization of Indonesian-English Code-Mixed Twitter Data. W-NUT, EMNLP [Paper] [Dataset]
- Singh, et al. (2018) A Twitter Corpus for Hindi-English Code Mixed POS Tagging. Sixth International Workshop on Natural Language Processing for Social Media, ACL [Paper]
- Li, et al. (2012) A Mandarin-English Code-Switching Corpus. LREC [Paper]
- Lyu, et al. (2010) SEAME: A Mandarin-English Code-Switching Speech Corpus in South-East Asia. Interspeech [Paper]
- Lyu, et al. (2010) An Analysis of a Mandarin-English Code-switching Speech Corpus: SEAME. Age [Paper]
Language Modeling and Speech Recognition
- Kumar, et al. (2020) Machine Learning based Language Modelling of Code Switched Data. International Conference on Electronics and Sustainable Communication Systems (ICESC) [Paper]
- Madhumani, et al. (2020) Learning not to Discriminate: Task Agnostic Learning for Improving Monolingual and Code-switched Speech Recognition. Arxiv [Paper]
- Shah, et al. (2020) Learning to Recognize Code-switched Speech Without Forgetting Monolingual Speech Recognition. Arxiv [Paper]
- Winata, et al. (2020) Meta-Transfer Learning for Code-Switched Speech Recognition. ACL [Paper] [Code]
- Chandu, et al. (2020) Style Variation as a Vantage Point for Code-Switching. Arxiv [Paper]
- Ganji Sreeram and Rohit Sinha (2020) Exploration of End-to-End Framework for Code-Switching Speech Recognition Task: Challenges and Enhancements. IEEE Access [Paper]
- Winata, et al. (2019) Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences. CoNLL [Paper]
- Hila Gonen and Yoav Goldberg (2019) Language Modeling for Code-Switching:Evaluation, Integration of Monolingual Data, and Discriminative Training. EMNLP [Paper]
- Lee, et al. (2019) Linguistically Motivated Parallel Data Augmentation for Code-switch Language Modeling. Interspeech [Paper]
- Victor Soto and Julia Hirschberg (2019) Improving Code-Switched Language Modeling Performance Using Cognate Features. Interspeech [Paper]
- Chang, et al. (2019) Code-switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation. Interspeech [Paper]
- Zeng, et al. (2019) On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition. Interspeech [Paper]
- Taneja, et al. (2019) Exploiting Monolingual Speech Corpora for Code-mixed Speech Recognition. Interspeech [Paper]
- Shan, et al. (2019) Investigating End-to-end Speech Recognition for Mandarin-english Code-switching. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) [Paper]
- Grandee Lee, Haizhou Li. (2019) Word and Class Common Space Embedding for Code-switch Language Modelling. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) [Paper]
- Hamed, et al. (2019) Code-Switching Language Modeling with Bilingual Word Embeddings: A Case Study for Egyptian Arabic-English. International Conference on Speech and Computer [Paper]
- Winata, et al. (2018) Learn to Code-Switch: Data Augmentation using Copy Mechanism on Language Modeling. Arxiv [Paper]
- Winata, et al. (2018) Towards End-to-end Automatic Code-Switching Speech Recognition. Arxiv [Paper]
- Nakayama, et al. (2018) Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS. IEEE Spoken Language Technology Workshop (SLT) [Paper]
- Jesse Emond, Bhuwana Ramabhadran, Brian Roark, Pedro Moreno, and Min Ma. (2018) Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance, IEEE Spoken Language Technology Workshop (SLT) [Paper]
- Ganji Sreeram and Rohit Sinha. (2018) Exploiting Parts-of-Speech for Improved Textual Modeling of Code-Switching Data. 2018 Twenty Fourth National Conference on Communications (NCC) [Paper]
- Garg, et al. (2018) Code-switched Language Models Using Dual RNNs and Same-Source Pretraining. EMNLP [Paper]
- Ewald van der Westhuizen and Thomas R. Niesler. (2018) Synthesised bigrams using word embeddings for code-switched ASR of four South African language pairs. Computer Speech and Language [Paper]
- Biswal, et al. (2018) Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech. Interspeech [Paper]
- Winata, et al. (2018) Code-Switching Language Modeling using Syntax-Aware Multi-Task Learning. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper] [Code]
- Chandu, et al. (2018) Language Informed Modeling of Code-Switched Text. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
- Pratapa, et al. (2018) Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data. ACL [Paper]
- Sivasankaran, et al. (2018) Phone Merging For Code-Switched Speech Recognition. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
- Garg, et al. (2018) Dual Language Models for Code Switched Speech Recognition. Interspeech [Paper]
- Baheti, et al. (2017) Curriculum Design for Code-switching: Experiments with Language Identification and Language Modeling with Deep Neural Networks. ICON [Paper]
- Adel, et al. (2015) Syntactic and Semantic Features For Code-Switching Factored Language Models. IEEE Transactions on Audio, Speech, and Language Processing [Paper]
- Ying Li and Pascale Fung. (2014) Code switch language modeling with Functional Head Constraint. ICASSP [Paper]
- Ying Li and Pascale Fung. (2014) Language Modeling with Functional Head Constraint for Code Switching Speech Recognition. EMNLP [Paper]
- Adel, et al. (2013) Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling. ACL [Paper]
- Adel, et al. (2013) Recurrent neural network language modeling for code switching conversational speech. ICASSP [Paper]
- Vu, et al. (2012) A First Speech Recognition System for Mandarin-English Code-Switch Conversational Speech. ICASSP [Paper]
- Ying Li and Pascale Fung. (2012) Code-switch Language Model with Inversion Constraints for Mixed Language Speech Recognition. COLING [Paper]
- Li, et al. (2011) Asymmetric acoustic modeling of mixed language speech. ICASSP [Paper]
Discourse
- Sravani, et al. (2021) Political Discourse Analysis: A Case Study of Code Mixing and Code Switching in Political Speeches. Proceedings of the 5th Workshop on Computational Approaches to Code Switching (CALCS), NAACL [Paper]
Generation
- Gupta, et al. (2020) A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning. Findings of EMNLP [Paper]
- Bryan Gregorius and Takeshi Okadome (2022) Generating Code-Switched Text from Monolingual Text with Dependency Tree. The 20th Annual Workshop of the Australasian Language Technology Association [Paper] [Code]
Speech Synthesis
- Sai Krishna Rallabandi and Alan W Black (2019) Variational Attention using Articulatory Priors for generating Code Mixed Speech using Monolingual Corpora. Interspeech [Paper]
- Sai Krishna Rallabandi and Alan W Black (2017) On Building Mixed Lingual Speech Synthesis Systems. Interspeech [Paper]
- Chandu, et al. (2017) Speech Synthesis for Mixed-Language Navigation Instructions. Interspeech [Paper]
Metric
- Guzman, et al. (2017) Metrics for modeling code-switching across corpora. Interspeech [Paper]
Representation Learning
- Prasad, et al. (2021) The Effectiveness of Intermediate-Task Training for Code-Switched Natural Language Understanding. Proceedings of the 1st Workshop on Multilingual Representation Learning, EMNLP [Paper]
- Winata, et al. (2021) Are Multilingual Models Effective in Code-Switching?. Proceedings of the 5th Workshop on Computational Approaches to Code Switching (CALCS), NAACL [Paper]
- Rizal, et al. (2020) Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data. Proceedings of the 4th Workshop on Computational Approaches to Code Switching (CALCS), LREC [Paper]
- Winata, et al. (2019) Hierarchical Meta-Embeddings for Code-Switching Named Entity Recognition. EMNLP [Paper] [Code]
- Pratapa, et al. (2018) Word Embeddings for Code-Mixed Language Processing. EMNLP [Paper]
Machine Translation
- Gaser, et al. (2023) Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text. EACL [Paper]
- Vivek Srivastava and Mayank Singh (2020) PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation. W-NUT, EMNLP [Paper] [Dataset]
- Thoudam Doren Singh and Thamar Solorio. (2017) Towards Translating Mixed-Code Comments from Social Media. CICLing [Paper]
NLU
- Krishnan, et al. (2021) Multilingual Code-Switching for Zero-Shot Cross-Lingual Intent Prediction and Slot Filling. MRL, EMNLP [Paper]
Named Entity Recognition
- Priyadharshini, et al. (2020) Named Entity Recognition for Code-Mixed Indian Corpus using Meta Embedding. 6th International Conference on Advanced Computing and Communication Systems (ICACCS) [Paper]
- Winata, et al. (2019) Learning Multilingual Meta-Embeddings for Code-Switching Named Entity Recognition. RepL4NLP, ACL [Paper] [Code]
- Aguilar, et al. (2018) Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
- Wang, et al. (2018) Code-Switched Named Entity Recognition with Embedding Attention. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
- Winata, et al. (2018) Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
- Aguilar, et al. (2017) A Multi-task Approach for Named Entity Recognition in Social Media Data. 3rd Workshop on Noisy User-generated Text, EMNLP [Paper]
Linguistics
- Li Nyuyen. (2018) Borrowing or Code-switching? Traces of community norms in Vietnamese-English speech. Australian Journal of Linguistics 38.4 (2018): 443-466. [Paper]
- Fairchild, Sarah, and Janet G. Van Hell. (2017) Determiner-noun code-switching in Spanish heritage speakers. Bilingualism: Language and Cognition 20.1 (2017): 150-161. [Paper]
- Bhatt, Rakesh M., and Agnes Bolonyai. (2011) Code-switching and the optimal grammar of bilingual language use. Bilingualism: Language and Cognition 14.4 (2011): 522-546. [Paper]
- Lipski (2005) Code-switching or Borrowing? No sé so no puedo decir, you know. Second Workshop on Spanish Sociolinguistics [Paper]
- Roberto R. Heredia and Jeanette Altarriba (2001) Bilingual Language Mixing: Why Do Bilinguals Code-Switch? SAGE Publications [Paper]
- Belazi, et al. (1994) Code switching and X-bar theory: The functional head constraint. Linguistic inquiry Vol 25 No.2 Spring [Paper]
- Shana Poplack (1980) Sometimes i’ll start a sentence in spanish y termino en espanol: toward a typology of code-switching1. Linguistics 18(7-8) [Paper]
- Pfaff, Carol W. (1979) Constraints on language mixing: intrasentential code-switching and borrowing in Spanish/English. Language: 291-318. [Paper]
- Shana Poplack (1978) Syntactic structure and social function of code-switching. Vol. 2. Centro de Estudios Puertorriqueños, City University of New York [Paper]
- Gumperz, J. J., & Hernandez, E. (1969) Cognitive aspects of bilingual communication. Institute of International Studies, University of California [Paper]
Affective Computing
- Chakravarthi, et al. (2021) DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text. Arxiv [Paper] [Code and Dataset]
- Siddharth Yadav (2020) Unsupervised Sentiment Analysis for Code-mixed Data. Arxiv[Paper] [Code]
- Wang, et al. (2017) Emotion Analysis in Code-Switching Text With Joint Factor Graph Model. IEEE/ACM Transactions on Audio, Speech, and Language Processing [Paper]
- Wang, et al. (2016) A Bilingual Attention Network for Code-switched Emotion Prediction. COLING [Paper]
- Sophia Lee and Zhongqing Wang (2015) Emotion in Code-switching Texts: Corpus Construction and Analysis. Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing [Paper]
- Wang, et al. (2015) Emotion Detection in Code-switching Texts via Bilingual and Sentimental Information. ACL [Paper]
Dialog and Conversational System
- Gupta, et al. (2018) Uncovering Code-Mixed Challenges: A Framework for Linguistically Driven Question Generation and Neural based Question Answering. CoNLL [Paper]
Discourse
- Sravani, et al. (2021) Political Discourse Analysis: A Case Study of Code Mixing and Code Switching in Political Speeches. CALCS Proceedings of the 5th Workshop on Computational Approaches to Code Switching (CALCS), NAACL [Paper]
Syntax
- Kodali, et al. (2022) SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing. Findings of ACL [Paper]
- Özlem Çetinoglu and Çagrı Çöltekin (2019) Challenges of Annotating a Code-Switching Treebank. SyntaxFest [Paper]
Adversarial Attack
- Samson Tan and Shafiq Joty (2021) Code-Mixing on Sesame Street: Dawn of the Adversarial Polyglots. NAACL [Paper]
Social Linguistics
- Bolock, et al. (2020) Who, When and Why: The 3 Ws of Code-Switching. International Conference on Practical Applications of Agents and Multi-Agent Systems [Paper]
- Yoder, et al. (2017) Code-Switching as a Social Act:The Case of Arabic Wikipedia Talk Pages. Proceedings of the Second Workshop on Natural Language Processing and Computational Social Science, ACL [Paper]
- Agrawal, et al. (2017) Agarwal, Prabhat, et al. I may talk in English but gaali toh Hindi mein hi denge: A study of English-Hindi code-switching and swearing pattern on social networks. International Conference on Communication Systems and Networks (COMSNETS) [Paper]
Benchmark
- Khanuja, et al. (2020) GLUECoS : An Evaluation Benchmark for Code-Switched NLP. ACL [Paper]
- Aguilar, et al. (2020) LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation. LREC [Paper]
Social Media
- Bali, et al. (2014) “I am borrowing ya mixing ?” An Analysis of English-Hindi Code Mixing in Facebook. Proceedings of The First Workshop on Computational Approaches to Code Switching [Paper]
Text Normalization
- Dwija Parikh and Thamar Solorio (2021) Normalization and Back-Transliteration for CodeSwitched Data. CALCS Proceedings of the 5th Workshop on Computational Approaches to Code Switching (CALCS), NAACL [Paper]
Toolkit
Synthetic Data Generation Toolkit
- Jayanthi, et al. (2021) CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing. CALCS Proceedings of the 5th Workshop on Computational Approaches to Code Switching (CALCS), NAACL [Paper] [Code]
- Rizvi, et al. (2021) GCM: A Toolkit for Generating Synthetic Code-mixed Text. EACL (System Demonstrations) [Paper] [Code]
Annotation Toolkit
- Shah, et al. (2019) CoSSAT: Code-Switched Speech Annotation Tool. Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP [Paper]
Summarization
Question Answering
- Gupta, et al. (2020) A Unified Framework for Multilingual and Code-Mixed Visual Question Answering. AACL-IJCNLP [TBA]
Dialog and Conversational System
- Bawa, et al. (2020) Do Multilingual Users Prefer Chat-bots that Code-mix? Let's Nudge and Find Out!. ACM on Human-Computer Interaction [Paper]
- Banerjee, et al. (2018) A Dataset for Building Code-Mixed Goal Oriented Conversation Systems. COLING [Paper]
Position Paper
- Nguyen, et al. (2022) Building Educational Technologies for Code-Switching: Current Practices, Difficulties and Future Directions. Languages [Paper]
Books
- Caciullos and Travis (2018) Bilingualism in the Community. Cambridge University Press