Below are some resources we have been involved in developing that are publicly available:
- Layer-wise speech representation model analysis codebase (code)
- The OpenASL Open-Domain Sign Language Translation Dataset (data)
- SLUE: Spoken Language Understanding Evaluation benchmark (data, code, leaderboard)
- NatCat: naturally annotated category-text pairs for training text classifiers (data)
- TVRecap: A dataset for generating stories with character descriptions (data)
- WikiTableT: A data-to-text dataset pairing Wikipedia article sections with diverse data sources (data)
- SummScreen: An abstractive screenplay summarization dataset derived from TV series transcripts and human written recaps (data)
- The Chicago Fingerspelling in the Wild data sets (data)
- Acoustic and acoustically grounded word embeddings (code, embeddings)
- PARAGRAM, PARAGRAM-PHRASE, and CHARAGRAM embeddings of words and sentences (code, embeddings, data, pretrained models)
- Deep CCA and related methods for multi-view representation learning (code)
- Who did What : A Large-Scale Person-Centered Cloze Dataset (data, leaderboard)