Skip to content

fadelmvi/keyphrase-extraction-with-textrank

Repository files navigation

Keyphrase Extraction with TextRank (Indonesian)

Python Flask License

An automatic keyphrase extraction system for Indonesian text using the TextRank algorithm. This project provides a REST API and web interface for algorithm performance evaluation.

📖 Description

TextRank is a graph-based ranking algorithm inspired by Google's PageRank for extracting important keywords from documents. This system is optimized for Indonesian text with:

  • POS Tagging using CRF (Conditional Random Fields) for Indonesian language
  • Stopword removal with Indonesian stopword list
  • Multi-word keyphrase extraction for phrase keywords

🏗️ Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Input Text    │ ──▶ │  Preprocessing   │ ──▶ │    TextRank     │
│  (Indonesian)   │     │  - Tokenization  │     │  - Build Graph  │
│                 │     │  - Stopwords     │     │  - PageRank     │
│                 │     │  - POS Tagging   │     │  - Ranking      │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                                         │
                                                         ▼
                                                 ┌─────────────────┐
                                                 │   Keyphrases    │
                                                 │  ["machine      │
                                                 │   learning",    │
                                                 │   "algorithm"]  │
                                                 └─────────────────┘

📁 Project Structure

keyphrase-extraction-with-textrank/
├── app.py                 # REST API endpoint
├── app-main.py           # Web interface for evaluation
├── textrank.py           # TextRank algorithm implementation
├── preprocessing.py      # Indonesian text preprocessing
├── data.py               # Evaluation dataset (Indonesian journals)
├── requirements.txt      # Dependencies
├── runtime.txt           # Python version (Heroku)
├── Procfile             # Heroku deployment config
├── POS_Tagger/
│   └── all_indo_man_tag_corpus_model.crf.tagger  # CRF Model
├── stopwords/
│   └── stopword.txt      # Indonesian stopword list
├── templates/
│   └── index.html        # Evaluation template
└── static/
    └── styles/
        └── styles.css    # Stylesheet

🚀 Installation

Prerequisites

  • Python 3.9+
  • pip (Python package manager)

Setup

  1. Clone the repository

    git clone https://github.com/username/keyphrase-extraction-with-textrank.git
    cd keyphrase-extraction-with-textrank
  2. Create a virtual environment (optional but recommended)

    python -m venv venv
    source venv/bin/activate  # Linux/macOS
    # or
    venv\Scripts\activate     # Windows
  3. Install dependencies

    pip install -r requirements.txt
  4. Download NLTK data

    python -c "import nltk; nltk.download('punkt')"

💻 Usage

REST API

Run the API server:

python app.py

Server will run at http://localhost:5005

Endpoints

Method Endpoint Description
GET /modeltextrank API info
POST /modeltextrank Extract 10 keyphrases
POST /modeltextrank/<n> Extract n keyphrases
GET /health Health check

Example Request

curl -X POST http://localhost:5005/modeltextrank \
  -H "Content-Type: application/json" \
  -d '{"abstract": "Machine learning adalah cabang dari kecerdasan buatan yang fokus pada pengembangan sistem yang dapat belajar dari data."}'

Example Response

{
  "keyphrases": [
    "machine learning",
    "kecerdasan buatan",
    "sistem",
    "data",
    "pengembangan"
  ],
  "count": 5
}

Web Interface (Evaluation)

Run the evaluation application:

python app-main.py

Open http://localhost:5005 in your browser to view evaluation results with metrics:

  • Precision: Accuracy of extracted keyphrases
  • Recall: Completeness of found keyphrases
  • F1-Score: Harmonic mean of precision and recall

📊 TextRank Algorithm

Steps

  1. Preprocessing

    • Text tokenization
    • Stopword removal
    • POS tagging (filter: Noun, Adjective, Foreign Word)
  2. Building the Graph

    • Nodes: candidate words
    • Edges: co-occurrence within window (default: 4 words)
  3. PageRank Iteration

    PR(Vi) = (1-d) + d × Σ PR(Vj) / |Out(Vj)|
    
    • d = damping factor (0.85)
    • Iterate until convergence
  4. Ranking & Extraction

    • Sort by PageRank score
    • Combine adjacent words for multi-word phrases

Parameters

Parameter Default Description
damping_factor 0.85 PageRank damping factor
window_size 4 Co-occurrence window size
max_iterations 100 Maximum PageRank iterations
convergence_threshold 1e-5 Convergence threshold

🔧 Configuration

Modifying TextRank Parameters

from textrank import TextRank

# Customize parameters
tr = TextRank(
    damping_factor=0.85,
    window_size=5,
    max_iterations=200
)

Adding Stopwords

Edit the stopwords/stopword.txt file and add new words (one word per line).

📈 Evaluation

The evaluation dataset contains 14 Indonesian journal abstracts with manual keywords as ground truth.

Running Evaluation

python app-main.py

Computed Metrics

  • True Positive (TP): Keyphrases matching journal keywords
  • False Positive (FP): Extracted keyphrases not in journal keywords
  • False Negative (FN): Journal keywords not successfully extracted
  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • F1-Score: 2 × (Precision × Recall) / (Precision + Recall)

🚢 Deployment

Heroku

The project is already configured for Heroku deployment:

heroku create your-app-name
git push heroku main

Docker (Optional)

FROM python:3.9-slim

WORKDIR /app
COPY . .
RUN pip install -r requirements.txt

EXPOSE 5005
CMD ["gunicorn", "-b", "0.0.0.0:5005", "app:app"]

🛠️ Development

Running Tests

# TODO: Add unit tests
python -m pytest tests/

Code Style

This project follows the PEP 8 style guide. Use:

pip install black flake8
black .
flake8 .

📚 References

📄 License

MIT License - Feel free to use and modify as needed.

👤 Author

Fadel Muhammad

🤝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Create a Pull Request

Note: This project was developed for research and educational purposes in the field of Natural Language Processing (NLP) for the Indonesian language.

About

An automatic keyphrase extraction system for Indonesian text using the TextRank algorithm

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages