GPTopic: Dynamic and Interactive Topic Representations
Published:
- Authors: Arik Reuter, Anton Thielmann, Christoph Weisser, Sebastian Fischer, Benjamin Säfken
- Link: https://arxiv.org/abs/2403.03628
- code:
This paper introduces GPTopic, a novel approach to dynamic and interactive topic representations that combines traditional topic modeling with modern language models for better interpretability and user interaction.
Notes
Background to the paper
- Unsupervised document classification for imbalanced data sets poses a major challenge as it requires carefully curated dataset.
- The authors propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirich-let Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling.
- Topic modeling with method like Latent Dirichlet Allocation (LDA) topic modeling (most commonly used model) is satisfactory for large text, but for small texts like (tweets) it is challenging.
- The author compared the performance of LDA, Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data.
- They found that GSDMM and GPM are better for sparse data.
Notes
- EMBEDDING → DIMENSIONALITY REDUCTION → CLUSTERING → TOP WORDS ( I-TF-IDF and COSINE SIMILARITY) → LLM (GPT 4.0 AND GPT 3.5) for topic naming and description
- Answering questions using LLM (ChatGPT + RAG Implementation)
- Topic Modification
- splitting using keywords.
- splitting using k-means