Hi Experts,
We are dealing with Text Classification Problem. We have around 80K records with around 50 classes. The data is highly imbalanced. It has 2 columns one for description and other contains class.
Till now we have tried following models and techniques:
- Data Preprocessing:
- Lowercase conversion, removed numeric texts, removed punctuations
- Removed unimportant words and stop words
- Lemmatization
- TFIDF transformation
- Using SKLEARN Models:
- Linear SVC
- Linear Regression
- Logistic Regression
- Decision Trees
- Random Forest
- Using Huggingface Transformers:
- Google Bert
- Distil Bert
- SMOTE sampling
It is observed that the maximum accuracy we got is 70% (Random Forest and Google Bert).
Is there any scope to improve accuracy?
If yes, what other techniques or models we can use to improve accuracy?