- input: a text in Persian
- output: classifying text as ironic and non-ironic
Existing datasets:
-
Persian manually labeled dataset: MirasIrony
-
Persian automatically labeled dataset: Persian Irony Detection
Create new dataset steps (Crawling Persian tweets from a channel on Telegram and automatically labeling them)
-
crawling: Crawl public channels' messages on Telegram using the api of Telegram server in file
crawling.py. Save crawled messages (json files) in./crawled_messages -
gathering: Concatenate crawled files, save wanted attributes of each tweet in a Pandas DataFrame, and save it in a csv file. The file
gathering.pycreatesmessages.csv. -
cleaning: Basic clean on the previously created dataset and save it to
messages_cleaned.csv -
labeling: Set label to each tweet by its top-2 common reactions and split dataset to Train and Test sets. It saves files in
../dataset/. -
Run: (The previous dataset will be replaced)
cd creating_dataset/
pip install requirements.txt
python crawling.py
python gathering.py
python cleaning.py
python labeling.pyFinetuning an uncased language model on the Persian irony detection dataset
cd model/
pip install -r requirements.txtFinetuning a transformer-based language model on irony detection dataset
python train.py --datapath [path to dataset] --modelpath [path to transformer-based language model] --modelout [path to save finetuned model] --savemodel [path to save finetuned model] --maxlen [maximum sequence length] --batch [batch size] --epoch [epochs] --lr [learning rate]
# example
python train.py --datapath ../dataset/ --modelpath xlm-roberta-base --batch 16 --epoch 5Predict label using trained model
python predict.py --datapath [path to dataset] --modelpath [path to transformer-based language model] --predspath [path for preditions of test set] --maxlen [maximum sequence length] --batch [batch size] --epoch [epochs] --lr [learning rate]
# example
python predict.py --datapath ../dataset/ --modelpath xlm-roberta-base --predspath runs/predsComparison of different finetuned language models on the Persian dataset
| Language Model | Accuracy | Recall | Precision | F1 |
|---|---|---|---|---|
| ParsBert vr3 | 81.3% | 81.4% | 81.3% | 81.3% |
| XLM-RoBERTa-Base | 82.6% | 82.8% | 82.6% | 82.5% |
| XLM-RoBERTa-Large | 84.7% | 84.7% | 84.6% | 84.6% |