An automatic keyphrase extraction system for Indonesian text using the TextRank algorithm. This project provides a REST API and web interface for algorithm performance evaluation.
TextRank is a graph-based ranking algorithm inspired by Google's PageRank for extracting important keywords from documents. This system is optimized for Indonesian text with:
- POS Tagging using CRF (Conditional Random Fields) for Indonesian language
- Stopword removal with Indonesian stopword list
- Multi-word keyphrase extraction for phrase keywords
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Input Text │ ──▶ │ Preprocessing │ ──▶ │ TextRank │
│ (Indonesian) │ │ - Tokenization │ │ - Build Graph │
│ │ │ - Stopwords │ │ - PageRank │
│ │ │ - POS Tagging │ │ - Ranking │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Keyphrases │
│ ["machine │
│ learning", │
│ "algorithm"] │
└─────────────────┘
keyphrase-extraction-with-textrank/
├── app.py # REST API endpoint
├── app-main.py # Web interface for evaluation
├── textrank.py # TextRank algorithm implementation
├── preprocessing.py # Indonesian text preprocessing
├── data.py # Evaluation dataset (Indonesian journals)
├── requirements.txt # Dependencies
├── runtime.txt # Python version (Heroku)
├── Procfile # Heroku deployment config
├── POS_Tagger/
│ └── all_indo_man_tag_corpus_model.crf.tagger # CRF Model
├── stopwords/
│ └── stopword.txt # Indonesian stopword list
├── templates/
│ └── index.html # Evaluation template
└── static/
└── styles/
└── styles.css # Stylesheet
- Python 3.9+
- pip (Python package manager)
-
Clone the repository
git clone https://github.com/username/keyphrase-extraction-with-textrank.git cd keyphrase-extraction-with-textrank -
Create a virtual environment (optional but recommended)
python -m venv venv source venv/bin/activate # Linux/macOS # or venv\Scripts\activate # Windows
-
Install dependencies
pip install -r requirements.txt
-
Download NLTK data
python -c "import nltk; nltk.download('punkt')"
Run the API server:
python app.pyServer will run at http://localhost:5005
| Method | Endpoint | Description |
|---|---|---|
| GET | /modeltextrank |
API info |
| POST | /modeltextrank |
Extract 10 keyphrases |
| POST | /modeltextrank/<n> |
Extract n keyphrases |
| GET | /health |
Health check |
curl -X POST http://localhost:5005/modeltextrank \
-H "Content-Type: application/json" \
-d '{"abstract": "Machine learning adalah cabang dari kecerdasan buatan yang fokus pada pengembangan sistem yang dapat belajar dari data."}'{
"keyphrases": [
"machine learning",
"kecerdasan buatan",
"sistem",
"data",
"pengembangan"
],
"count": 5
}Run the evaluation application:
python app-main.pyOpen http://localhost:5005 in your browser to view evaluation results with metrics:
- Precision: Accuracy of extracted keyphrases
- Recall: Completeness of found keyphrases
- F1-Score: Harmonic mean of precision and recall
-
Preprocessing
- Text tokenization
- Stopword removal
- POS tagging (filter: Noun, Adjective, Foreign Word)
-
Building the Graph
- Nodes: candidate words
- Edges: co-occurrence within window (default: 4 words)
-
PageRank Iteration
PR(Vi) = (1-d) + d × Σ PR(Vj) / |Out(Vj)|- d = damping factor (0.85)
- Iterate until convergence
-
Ranking & Extraction
- Sort by PageRank score
- Combine adjacent words for multi-word phrases
| Parameter | Default | Description |
|---|---|---|
damping_factor |
0.85 | PageRank damping factor |
window_size |
4 | Co-occurrence window size |
max_iterations |
100 | Maximum PageRank iterations |
convergence_threshold |
1e-5 | Convergence threshold |
from textrank import TextRank
# Customize parameters
tr = TextRank(
damping_factor=0.85,
window_size=5,
max_iterations=200
)Edit the stopwords/stopword.txt file and add new words (one word per line).
The evaluation dataset contains 14 Indonesian journal abstracts with manual keywords as ground truth.
python app-main.py- True Positive (TP): Keyphrases matching journal keywords
- False Positive (FP): Extracted keyphrases not in journal keywords
- False Negative (FN): Journal keywords not successfully extracted
- Precision: TP / (TP + FP)
- Recall: TP / (TP + FN)
- F1-Score: 2 × (Precision × Recall) / (Precision + Recall)
The project is already configured for Heroku deployment:
heroku create your-app-name
git push heroku mainFROM python:3.9-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
EXPOSE 5005
CMD ["gunicorn", "-b", "0.0.0.0:5005", "app:app"]# TODO: Add unit tests
python -m pytest tests/This project follows the PEP 8 style guide. Use:
pip install black flake8
black .
flake8 .- Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing Order into Text. EMNLP.
- Indonesian POS Tagger
- NLTK Documentation
MIT License - Feel free to use and modify as needed.
Fadel Muhammad
- GitHub: @fadelmuhammad
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Create a Pull Request
Note: This project was developed for research and educational purposes in the field of Natural Language Processing (NLP) for the Indonesian language.