Semantic Text Classification and Explainer AI

Seeking Advice: Legal Document Classification with XAI

Hi everyone,

I’m working on a project involving classification of legal documents with an emphasis on explainer AI (XAI) to build trust and interpretability for end-users, especially legal practitioners. The dataset consists of lengthy and unstructured legal texts(court rulings, arguments, etc). I’d love to hear your insights and advice on tackling some challenges I’m facing!


My Work So Far

Preprocessing

  1. Boilerplate Removal: Filters repetitive legal jargon and irrelevant text.
  2. Stopword Removal
  3. NER and Lemmatization: Extracts key entities and normalizes words.
  4. Hierarchical Chunking: Splits long documents into smaller chunks with overlaps to retain context.

Model Architecture

I’m using a LegalBERT-based classifier fine-tuned for legal text understanding:

  1. LegalBERT: Extracts contextual embeddings.
  2. Neural Layers:
  • BiLSTM + GRU: Captures sequential dependencies and contextual patterns.
  • GlobalMaxPooling
  • Dense + Dropout
  1. Output Layer: Softmax for classifying legal categories.

Despite my efforts, I’ve only achieved an accuracy of ~50% on test data :sweat_smile:. I suspect that better preprocessing or semantic integration could help improve performance.


Challenges and Questions

  1. Long and Unstructured Documents:
  • My dataset consists of lengthy, unstructured texts. Are there efficient techniques for preprocessing or segmenting such data to better capture semantics and structure?
  1. Incorporating Semantics and Rhetorical Roles:
  • I’d like to integrate semantic understanding into the model and identify rhetorical roles (e.g., facts, issues, arguments) automatically. Are there any pre-trained models or frameworks you recommend for this?
  1. Explainability:
  • For clear and effective explainability, are attention mechanisms, SHAP, or LIME suitable for legal contexts? Are there other XAI approaches tailored for legal document classification?

Focus

My primary focus is on improving semantic integration in the classification process.

If you’ve faced similar challenges or have insights on tools, frameworks, or strategies for legal AI projects, I’d love to hear about your experiences!

Thank you in advance for your time and support!