Waleed Ammar

Learn More
We train one multilingual model for dependency parsing and use it to parse sentences in several languages. The parsing model uses (i) multilingual word clusters and embeddings; (ii) token-level language information; and (iii) language-specific features (finegrained POS tags). This input representation enables the parser not only to parse effectively in(More)
We introduce a framework for unsupervised learning of structured predictors with overlapping, global features. Each input’s latent representation is predicted conditional on the observed data using a feature-rich conditional random field (CRF). Then a reconstruction of the input is (re)generated, conditional on the latent structure, using a generative model(More)
We introduce new methods for estimating and evaluating embeddings of words in more than fifty languages in a single shared embedding space. Our estimation methods, multiCluster and multiCCA, use dictionaries and monolingual data; they do not require parallel data. Our new evaluation method, multiQVECCCA, is shown to correlate better than previous ones with(More)
Online discussions forums, known as forums for short, are conversational social cyberspaces constituting rich repositories of content and an important source of collaborative knowledge. However, most of this knowledge is buried inside the forum infrastructure and its extraction is both complex and difficult. The ability to automatically rate postings in(More)
Unsupervised word embeddings have been shown to be valuable as features in supervised learning problems; however, their role in unsupervised problems has been less thoroughly explored. In this paper, we show that embeddings can likewise add value to the problem of unsupervised POS induction. In two representative models of POS induction, we replace(More)
We describe the CMU systems submitted to the 2013 WMT shared task in machine translation. We participated in three language pairs, French–English, Russian– English, and English–Russian. Our particular innovations include: a labelcoarsening scheme for syntactic tree-totree translation and the use of specialized modules to create “synthetic translation(More)
We train a language-universal dependency parser on a multilingual collection of treebanks. The parsing model uses multilingual word embeddings alongside learned and specified typological information, enabling generalization based on linguistic universals and based on typological similarities. We evaluate our parser’s performance on languages in the training(More)
Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character(More)
We describe the CMU submission for the 2014 shared task on language identification in code-switched data. We participated in all four language pairs: Spanish–English, Mandarin–English, Nepali–English, and Modern Standard Arabic–Arabic dialects. After describing our CRF-based baseline system, we discuss three extensions for learning from unlabeled data:(More)
Privacy policies are a nearly ubiquitous feature of websites and online services, and the contents of such policies are legally binding for users. However, the obtuse language and sheer length of most privacy policies tend to discourage users from reading them. We describe a pilot experiment to use automatic text categorization to answer simple categorical(More)