This project is a machine learning-based application designed to detect and classify hate speech in text data. Utilizing Natural Language Processing (NLP) techniques and a Decision Tree classifier, the system can identify whether a given text contains hate speech, offensive language, or is neutral.
The project includes a complete machine learning pipeline from data preprocessing to model deployment using a user-friendly Streamlit web interface.
- Text Classification: Classifies text into three categories:
- Hate Speech: Content that expresses hatred.
- Offensive Language: Content that is offensive but not necessarily hate speech.
- Neither: Neutral or non-offensive content.
- Interactive Web App: A Streamlit-based interface for real-time predictions.
- NLP Pipeline: Includes stemming, stop-word removal, and Count Vectorization.
- Visualizations: Provides confusion matrices and classification reports for model evaluation.
Hate_Speech_Detection/
├── data/ # Dataset files (e.g., tweets.csv)
├── ml_pipeline/ # Source code for the ML pipeline
│ ├── data_preprocessing.py # Data cleaning and preprocessing
│ ├── deploy.py # Streamlit application
│ ├── model_training.py # Model training script
│ ├── model_evaluation.py # Model evaluation script
│ ├── pipeline.py # Main pipeline runner
│ └── ...
├── model/ # Saved models (.pkl files)
├── Notebooks/ # Jupyter notebooks for exploration
├── requirements.txt # Python dependencies
├── setup.sh # Setup script
└── README.md # Project documentation
The project uses a dataset of tweets labeled for hate speech detection.
- Source: Kaggle (or specify if different)
- Labels:
0: Neither1: Hate Speech2: Offensive Language
- Language: Python
- Libraries:
scikit-learn: For model building and evaluation.pandas&numpy: For data manipulation.nltk: For Natural Language Processing tasks.streamlit: For the web application.matplotlib&seaborn: For data visualization.
-
Clone the repository:
git clone https://github.com/Susreel7/SocialMedia_Hate_Speech_Detection.git cd SocialMedia_Hate_Speech_Detection -
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
To launch the interactive Streamlit application:
streamlit run ml_pipeline/deploy.pyThe app will open in your browser at http://localhost:8501.
If you want to retrain the model with new data:
- Place your dataset in the
data/directory. - Run the pipeline script:
python ml_pipeline/pipeline.py
This will preprocess the data, train the model, evaluate it, and save the new artifacts in the model/ directory.
The model is evaluated using standard metrics:
- Accuracy: ~88%
- Precision, Recall, F1-Score: Detailed reports are generated during training.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/YourFeature). - Commit your changes (
git commit -m 'Add some feature'). - Push to the branch (
git push origin feature/YourFeature). - Open a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset provided by Kaggle.
- Inspiration from various NLP research papers on hate speech detection.