Keyphrase Extraction with TextRank (Indonesian)

An automatic keyphrase extraction system for Indonesian text using the TextRank algorithm. This project provides a REST API and web interface for algorithm performance evaluation.

📖 Description

TextRank is a graph-based ranking algorithm inspired by Google's PageRank for extracting important keywords from documents. This system is optimized for Indonesian text with:

POS Tagging using CRF (Conditional Random Fields) for Indonesian language
Stopword removal with Indonesian stopword list
Multi-word keyphrase extraction for phrase keywords

🏗️ Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Input Text    │ ──▶ │  Preprocessing   │ ──▶ │    TextRank     │
│  (Indonesian)   │     │  - Tokenization  │     │  - Build Graph  │
│                 │     │  - Stopwords     │     │  - PageRank     │
│                 │     │  - POS Tagging   │     │  - Ranking      │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                                         │
                                                         ▼
                                                 ┌─────────────────┐
                                                 │   Keyphrases    │
                                                 │  ["machine      │
                                                 │   learning",    │
                                                 │   "algorithm"]  │
                                                 └─────────────────┘

📁 Project Structure

keyphrase-extraction-with-textrank/
├── app.py                 # REST API endpoint
├── app-main.py           # Web interface for evaluation
├── textrank.py           # TextRank algorithm implementation
├── preprocessing.py      # Indonesian text preprocessing
├── data.py               # Evaluation dataset (Indonesian journals)
├── requirements.txt      # Dependencies
├── runtime.txt           # Python version (Heroku)
├── Procfile             # Heroku deployment config
├── POS_Tagger/
│   └── all_indo_man_tag_corpus_model.crf.tagger  # CRF Model
├── stopwords/
│   └── stopword.txt      # Indonesian stopword list
├── templates/
│   └── index.html        # Evaluation template
└── static/
    └── styles/
        └── styles.css    # Stylesheet

🚀 Installation

Prerequisites

Python 3.9+
pip (Python package manager)

Setup

Clone the repository

git clone https://github.com/username/keyphrase-extraction-with-textrank.git
cd keyphrase-extraction-with-textrank

Create a virtual environment (optional but recommended)

python -m venv venv
source venv/bin/activate  # Linux/macOS
# or
venv\Scripts\activate     # Windows

Install dependencies
```
pip install -r requirements.txt
```

Download NLTK data

python -c "import nltk; nltk.download('punkt')"

💻 Usage

REST API

Run the API server:

python app.py

Server will run at http://localhost:5005

Endpoints

Method	Endpoint	Description
GET	`/modeltextrank`	API info
POST	`/modeltextrank`	Extract 10 keyphrases
POST	`/modeltextrank/<n>`	Extract n keyphrases
GET	`/health`	Health check

Example Request

curl -X POST http://localhost:5005/modeltextrank \
  -H "Content-Type: application/json" \
  -d '{"abstract": "Machine learning adalah cabang dari kecerdasan buatan yang fokus pada pengembangan sistem yang dapat belajar dari data."}'

Example Response

{
  "keyphrases": [
    "machine learning",
    "kecerdasan buatan",
    "sistem",
    "data",
    "pengembangan"
  ],
  "count": 5
}

Web Interface (Evaluation)

Run the evaluation application:

python app-main.py

Open http://localhost:5005 in your browser to view evaluation results with metrics:

Precision: Accuracy of extracted keyphrases
Recall: Completeness of found keyphrases
F1-Score: Harmonic mean of precision and recall

📊 TextRank Algorithm

Steps

Preprocessing
- Text tokenization
- Stopword removal
- POS tagging (filter: Noun, Adjective, Foreign Word)
Building the Graph
- Nodes: candidate words
- Edges: co-occurrence within window (default: 4 words)
PageRank Iteration
```
PR(Vi) = (1-d) + d × Σ PR(Vj) / |Out(Vj)|
```
- d = damping factor (0.85)
- Iterate until convergence
Ranking & Extraction
- Sort by PageRank score
- Combine adjacent words for multi-word phrases

Parameters

Parameter	Default	Description
`damping_factor`	0.85	PageRank damping factor
`window_size`	4	Co-occurrence window size
`max_iterations`	100	Maximum PageRank iterations
`convergence_threshold`	1e-5	Convergence threshold

🔧 Configuration

Modifying TextRank Parameters

from textrank import TextRank

# Customize parameters
tr = TextRank(
    damping_factor=0.85,
    window_size=5,
    max_iterations=200
)

Adding Stopwords

Edit the stopwords/stopword.txt file and add new words (one word per line).

📈 Evaluation

The evaluation dataset contains 14 Indonesian journal abstracts with manual keywords as ground truth.

Running Evaluation

python app-main.py

Computed Metrics

True Positive (TP): Keyphrases matching journal keywords
False Positive (FP): Extracted keyphrases not in journal keywords
False Negative (FN): Journal keywords not successfully extracted
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
F1-Score: 2 × (Precision × Recall) / (Precision + Recall)

🚢 Deployment

Heroku

The project is already configured for Heroku deployment:

heroku create your-app-name
git push heroku main

Docker (Optional)

FROM python:3.9-slim

WORKDIR /app
COPY . .
RUN pip install -r requirements.txt

EXPOSE 5005
CMD ["gunicorn", "-b", "0.0.0.0:5005", "app:app"]

🛠️ Development

Running Tests

# TODO: Add unit tests
python -m pytest tests/

Code Style

This project follows the PEP 8 style guide. Use:

pip install black flake8
black .
flake8 .

📚 References

Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing Order into Text. EMNLP.
Indonesian POS Tagger
NLTK Documentation

📄 License

MIT License - Feel free to use and modify as needed.

👤 Author

Fadel Muhammad

GitHub: @fadelmuhammad

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Create a Pull Request

Note: This project was developed for research and educational purposes in the field of Natural Language Processing (NLP) for the Indonesian language.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
POS_Tagger		POS_Tagger
__pycache__		__pycache__
static/styles		static/styles
stopwords		stopwords
templates		templates
venv		venv
Procfile		Procfile
README.md		README.md
app-main.py		app-main.py
app.py		app.py
data.py		data.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
runtime.txt		runtime.txt
textrank.py		textrank.py

Folders and files

Latest commit

History

Repository files navigation

Keyphrase Extraction with TextRank (Indonesian)

📖 Description

🏗️ Architecture

📁 Project Structure

🚀 Installation

Prerequisites

Setup

💻 Usage

REST API

Endpoints

Example Request

Example Response

Web Interface (Evaluation)

📊 TextRank Algorithm

Steps

Parameters

🔧 Configuration

Modifying TextRank Parameters

Adding Stopwords

📈 Evaluation

Running Evaluation

Computed Metrics

🚢 Deployment

Heroku

Docker (Optional)

🛠️ Development

Running Tests

Code Style

📚 References

📄 License

👤 Author

🤝 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages