GPTopic: Dynamic and Interactive Topic Representations

less than 1 minute read

Published:

Notes

Background to the paper

Thielmann 2020

  • Unsupervised document classification for imbalanced data sets poses a major challenge as it requires carefully curated dataset.
  • The authors propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirich-let Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling.

Weisser 2023

  • Topic modeling with method like Latent Dirichlet Allocation (LDA) topic modeling (most commonly used model) is satisfactory for large text, but for small texts like (tweets) it is challenging.
  • The author compared the performance of LDA, Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data.
  • They found that GSDMM and GPM are better for sparse data.

Notes

  • EMBEDDING → DIMENSIONALITY REDUCTION → CLUSTERING → TOP WORDS ( I-TF-IDF and COSINE SIMILARITY) → LLM (GPT 4.0 AND GPT 3.5) for topic naming and description
  • Answering questions using LLM (ChatGPT + RAG Implementation)
  • Topic Modification
    • splitting using keywords.
    • splitting using k-means