Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering

  title={Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering},
  author={Di Jiang and Yuanfeng Song and Rongzhong Lian and Siqi Bao and Jinhua Peng and H. He and Hua Wu},
In the last decade, a variety of topic models have been proposed for text engineering. However, except Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA), most of existing topic models are seldom applied or considered in industrial scenarios. This phenomenon is caused by the fact that there are very few convenient tools to support these topic models so far. Intimidated by the demanding expertise and labor of designing and implementing parameter inference… 

TopicOcean: An Ever-Increasing Topic Model With Meta-learning

The novel TopicOcean framework is proposed, which aims to integrate well-trained topic models and transfer the knowledge of accumulated topics to new corpora in order to improve the quality of their topic models.

TopicNet: Making Additive Regularisation for Topic Modelling Accessible

The module features include powerful model visualization techniques, various training strategies, semi-automated model selection, support for user-defined goal metrics, and a modular approach to topic model training.

Expectation maximisation on unsupervised web mined data using probability latent semantic analysis (PLSA) algorithm

  • Chengeta Kennedy
  • Computer Science
    2019 International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD)
  • 2019
The study concluded that the PLSA algorithm is more efficient on processed data than raw web data and processing time was reduced when preprocessing was used to eliminate redundant latent variables.

A comparative study of online communities and popularity of BBS in four Chinese universities

A hypothesis test is introduced to infer individual preferred boards, which yields a polarization of users and a two-step model based on users’ preference and interests is devised to reproduce the observed connectivity patterns.

Federated Latent Dirichlet Allocation: A Local Differential Privacy Based Framework

FedLDA, a local differential privacy (LDP) based framework for federated learning of LDA models, contains a novel LDP mechanism called Random Response with Priori (RRP), which provides theoretical guarantees on both data privacy and model accuracy.

Learning to Select Context in a Hierarchical and Global Perspective for Open-Domain Dialogue Generation

A novel model with hierarchical self-attention mechanism and distant supervision to not only detect relevant words and utterances in short and long distances, but also discern related information globally when decoding is proposed.

L2RS: A Learning-to-Rescore Mechanism for Hybrid Speech Recognition

Experimental results show that L2RS outperforms not only traditional rescoring methods but also its deep neural network counterparts by a substantial margin of 20.85% in terms of NDCG@10.85%.

Topic-Aware Dialogue Speech Recognition with Transfer Learning

A novel transfer learning mechanism to conduct topicaware recognition for dialogue speech and demonstrates that proposed techniques in language model adaptation effectively improve the performance of the state-of-the-art Automatic Speech Recognition (ASR) system.

Analysis and Research Based on Instrument Drift Data

A set of application systems for instrument data import, export, storage, and analysis, which can store instrument calibration data onto a Hadoop database in a prescribed format provides great reference values for the calibration of instrument data.




Familia: An Open-Source Toolkit for Industrial Topic Modeling

Familia abstracts the utilities of topic modeling in industry as two paradigms: semantic representation and semantic matching, and provides off-the-shelf topic models trained on large-scale industrial corpora.

Software Framework for Topic Modelling with Large Corpora

This work describes a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion, and implements several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation in a way that makes them completely independent of the training corpus size.

Latent Dirichlet Allocation

Finding scientific topics

  • T. GriffithsM. Steyvers
  • Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 2004
A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.

Reducing the sampling complexity of topic models

An algorithm which scales linearly with the number of actually instantiated topics kd in the document, for large document collections and in structured hierarchical models kd ll k, yields an order of magnitude speedup.

Efficient methods for topic model inference on streaming document collections

Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory.

Integrating Social and Auxiliary Semantics for Multifaceted Topic Modeling in Twitter

A unified framework for Multifaceted Topic Modeling from Twitter streams is proposed to jointly model latent semantics among the social terms from Twitter, auxiliary terms from the linked Web documents and named entities, and the temporal characteristics of each topic.

Supervised Topic Models

The supervised latent Dirichlet allocation (sLDA) model, a statistical model of labelled documents, is introduced, which derives a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations.

Using Metafeatures to Increase the Effectiveness of Latent Semantic Models in Web Search

In web search, latent semantic models have been proposed to bridge the lexical gap between queries and documents that is due to the fact that searchers and content creators often use different

Learning deep structured semantic models for web search using clickthrough data

A series of new latent semantic models with a deep structure that project queries and documents into a common low-dimensional space where the relevance of a document given a query is readily computed as the distance between them are developed.