This repository showcases an advanced implementation for detecting Bengali hate speech using BanglaBERT and XLM-R models. It incorporates cutting-edge deep learning techniques, preprocessing methods, and tools to classify Bengali text into hate or non-hate speech categories. The repository is structured to provide an end-to-end solution, from data processing to model training and deployment.
- Purpose: Serves as the primary documentation for the BanglaBERT implementation. It provides detailed explanations of the techniques used, the purpose of each component, and step-by-step guidance for users.
- Purpose: Implements Optical Character Recognition (OCR) for Bengali text. This notebook processes images containing Bengali text and converts them into readable text format, enabling further natural language processing tasks.
- Purpose: The main README file for the repository. It includes an overview of the repository, setup instructions, and details on how to run each module.
- Purpose: Contains the implementation of BanglaBERT for hate speech detection. This notebook includes:
- Data loading and preprocessing.
- Model training using the BanglaBERT tokenizer and model.
- Evaluation metrics like accuracy, precision, recall, F1-score, and AUC.
- Purpose: Hugging face code to inference the model.
- Purpose: A specialized module for extracting text from images. This notebook integrates with the Gemini platform to convert Bengali text images into text, supporting OCR functionalities.
- Purpose: Implements the XLM-R model for multilingual hate speech detection. This notebook leverages the cross-lingual capabilities of XLM-R for handling code-mixed text and evaluating the model's performance on Bengali hate speech datasets.
- Purpose: An updated version of the XLM-R implementation. It includes:
- Fine-tuning strategies for improved performance.
- Advanced evaluation metrics.
- Comparisons with the BanglaBERT model.
The repository primarily uses Jupyter Notebooks and Python for implementation.
- Preprocessing and normalization for Bengali text.
- Data augmentation techniques like synonym replacement, random insertion, and deletion.
- Fine-tuning and training of pretrained models (BanglaBERT and XLM-R).
- Evaluation metrics for model performance.
- Model saving and deployment via Hugging Face's platform.
- OCR functionalities for Bengali text images.
-
Clone the repository:
git clone https://github.com/tajuar-akash-hub/Bengali_Hate_Speech_Detection_Using_BanglaBERT_-_XLM_R_ cd Bengali_Hate_Speech_Detection_Using_BanglaBERT_-_XLM_R_ -
Install dependencies:
pip install normalizer transformers torch datasets pandas scikit-learn
-
Explore the notebooks for specific functionalities:
- For BanglaBERT implementation:
banglabert.ipynb - For OCR:
Bangla_OCR.ipynb - For enhanced hate speech detection:
enhanced_banglabert_hate_speech.ipynb
- For BanglaBERT implementation:
Contributions to improve the repository and its functionalities are welcome. Fork the repository and create a pull request for any enhancements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.
The normalizer is a utility library specifically designed for text normalization in Bengali. It:
- Removes unnecessary characters or symbols.
- Fixes formatting issues like extra spaces.
- Ensures the text is consistent and ready for processing by models like BanglaBERT.
By using the normalizer, preprocessing becomes more efficient, improving tokenization and subsequent model performance.
A wheel is a distribution format used for Python packages. It is a pre-built package that can be installed without the need to compile code, making installations faster and more reliable. When you see "Building wheel" during installation, it means the library is being built into this format.
Tokenization is the process of splitting text into smaller units, such as words, subwords, or characters. For BanglaBERT:
- Converts raw text into numerical inputs that the model can process.
- Handles out-of-vocabulary words through subword tokenization.
- Ensures uniform input size via padding and truncation.
Tokenization aligns the input text with the model's pretrained vocabulary, significantly improving its ability to understand and process the text.
Weight decay is a regularization technique used to prevent overfitting. It:
- Penalizes large weights by adding a term to the loss function.
- Encourages the model to prefer smaller weights, making it less complex and more generalizable.
- Is implemented in the
AdamWoptimizer, which combines weight decay with adaptive learning rates.
Utility functions are helper functions that perform specific tasks to simplify the main operations of a program. In BanglaBERT, utility functions include:
- Data loading and preprocessing.
- Tokenization of text.
- Computing evaluation metrics (e.g., accuracy, F1-score).
- Saving and loading models or checkpoints.
These functions make the codebase modular and easier to maintain.
A checkpoint is a saved state of the model during training. It includes:
- Model weights and biases.
- Optimizer state.
- Learning rate scheduler state.
Purpose:
- Resume training from a specific point if interrupted.
- Restore the best-performing model for evaluation or deployment.
Early stopping is a training strategy that halts training when the model's performance on the validation set stops improving for a predefined number of epochs. Benefits:
- Prevents overfitting.
- Saves computational resources by stopping unnecessary training.
These refer to checkpoints saved at the 20th and 30th epochs during training. These snapshots allow:
- Analysis of the model's performance at specific stages.
- Resumption of training from those points if needed.
This file contains metadata for the tokenizer, including:
- Tokenizer type (e.g., WordPiece, Byte-Pair Encoding).
- Vocabulary size.
- Special tokens like
[PAD],[CLS], and[SEP].
This file defines the model's architecture and settings, such as:
- Number of layers, hidden units, and attention heads.
- Training hyperparameters (e.g., learning rate, weight decay).
- Metadata about the model's purpose and source.
This file is part of the SentencePiece tokenizer, implementing Byte-Pair Encoding (BPE). It contains:
- Vocabulary and subword units.
- Rules for splitting text into subwords.
This file provides a full representation of the tokenizer, including:
- Token-to-ID mappings.
- Encoding and decoding rules.
- Additional metadata for tokenization.
This file stores the vocabulary of the tokenizer. It includes:
- Tokens, subword units, or words.
- Their corresponding indices for numerical representation.
- Special tokens like
[PAD]and[UNK].
This file stores the trained model's weights in a safer and more efficient format compared to .bin. Advantages:
- Faster loading times.
- Reduced risk of file corruption.
- Better compatibility for distributed systems.
Models can be saved using the .save_pretrained() method from the Hugging Face library. This saves:
- Model weights: Parameters learned during training.
- Configuration: Stored in
config.json, which defines the model's architecture. - Tokenizer: Files like
tokenizer.json,tokenizer_config.json, andvocab.txt. - Special files: Includes
model.safetensorsorpytorch_model.binfor weight storage.
Example:
model.save_pretrained("./model")
tokenizer.save_pretrained("./model")When saving a model to Hugging Face's hub, the following files are uploaded:
config.json: Model architecture and metadata.tokenizer.json: Tokenizer rules and mappings.vocab.txt: Vocabulary of the tokenizer.model.safetensorsorpytorch_model.bin: Trained weights of the model.
The model is then accessible via the Hugging Face hub, enabling easy sharing and reuse.