Persian Irony Detection using transformer-based language models

input: a text in Persian
output: classifying text as ironic and non-ironic

Dataset

Existing datasets:

Persian manually labeled dataset: MirasIrony
Persian automatically labeled dataset: Persian Irony Detection

Create new dataset steps (Crawling Persian tweets from a channel on Telegram and automatically labeling them)

crawling: Crawl public channels' messages on Telegram using the api of Telegram server in file crawling.py. Save crawled messages (json files) in ./crawled_messages
gathering: Concatenate crawled files, save wanted attributes of each tweet in a Pandas DataFrame, and save it in a csv file. The file gathering.py creates messages.csv.
cleaning: Basic clean on the previously created dataset and save it to messages_cleaned.csv
labeling: Set label to each tweet by its top-2 common reactions and split dataset to Train and Test sets. It saves files in ../dataset/.
Run: (The previous dataset will be replaced)

cd creating_dataset/
pip install requirements.txt
python crawling.py
python gathering.py
python cleaning.py
python labeling.py

Model

Finetuning an uncased language model on the Persian irony detection dataset

cd model/ 
pip install -r requirements.txt

Finetuning a transformer-based language model on irony detection dataset

python train.py  --datapath [path to dataset] --modelpath [path to transformer-based language model] --modelout [path to save finetuned model] --savemodel [path to save finetuned model] --maxlen [maximum sequence length] --batch [batch size] --epoch [epochs] --lr [learning rate]
# example
python train.py --datapath ../dataset/ --modelpath xlm-roberta-base --batch 16 --epoch 5

Predict label using trained model

python predict.py  --datapath [path to dataset] --modelpath [path to transformer-based language model] --predspath [path for preditions of test set] --maxlen [maximum sequence length] --batch [batch size] --epoch [epochs] --lr [learning rate]
# example
python predict.py --datapath ../dataset/ --modelpath xlm-roberta-base --predspath runs/preds

Results

Comparison of different finetuned language models on the Persian dataset

Language Model	Accuracy	Recall	Precision	F1
ParsBert vr3	81.3%	81.4%	81.3%	81.3%
XLM-RoBERTa-Base	82.6%	82.8%	82.6%	82.5%
XLM-RoBERTa-Large	84.7%	84.7%	84.6%	84.6%

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
creating_dataset		creating_dataset
dataset		dataset
model		model
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Persian Irony Detection using transformer-based language models

Dataset

Model

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Persian Irony Detection using transformer-based language models

Dataset

Model

Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages