Speech and Language @ TTIC: Resources

Below are some resources we have been involved in developing that are publicly available:

SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction (code, demo)
ML-SUPERB 2.0 multilingual and multi-dialect speech recognition challenge (data, leaderboard)
Layer-wise speech representation model analysis codebase (code)
The OpenASL Open-Domain Sign Language Translation Dataset (data)
SLUE: Spoken Language Understanding Evaluation benchmark (data, code, leaderboard)
NatCat: naturally annotated category-text pairs for training text classifiers (data)
TVRecap: A dataset for generating stories with character descriptions (data)
WikiTableT: A data-to-text dataset pairing Wikipedia article sections with diverse data sources (data)
SummScreen: An abstractive screenplay summarization dataset derived from TV series transcripts and human written recaps (data)
The Chicago Fingerspelling in the Wild data sets (data)
Acoustic and acoustically grounded word embeddings (code, embeddings)
PARAGRAM, PARAGRAM-PHRASE, and CHARAGRAM embeddings of words and sentences (code, embeddings, data, pretrained models)
Deep CCA and related methods for multi-view representation learning (code)
Who did What : A Large-Scale Person-Centered Cloze Dataset (data, leaderboard)