📿 Creating a Comprehensive Annotated Corpus of Vietnamese Classical Buddhist Texts

Today we open-source our project for MTH020 - Advanced Natural Language Processing, where we created a comprehensive corpus with Named Entity Recognition (NER) annotations specifically for Vietnamese classical Buddhist literature. This project showcases how we can leverage advanced NLP techniques to build structured datasets that preserve and enable analysis of historical religious texts.

🚀 What Makes This Corpus Special?

In the fascinating intersection of technology and cultural preservation, creating high-quality annotated corpora has become crucial for digitizing and understanding historical texts. But what if we could go beyond simple text extraction and actually annotate the entities and relationships within these ancient Buddhist writings for future research?

That's exactly what we set out to create! 🎯

Our corpus doesn't just contain converted text—it provides structured annotations by:

📚 Creating Structured Corpora: Converting PDF documents into hierarchical XML structures
🏷️ Annotating Named Entities: Identifying and marking people, places, organizations, titles, time expressions, and numbers
🌐 Multi-Model Integration: Leveraging Azure AI, Underthesea, and Vietnamese-specific NLP models for annotations
📊 Comprehensive Export: Providing CSV, Excel, and annotated XML formats for research

🏗️ Architecture Overview

graph TB
    A[📄 Buddhist PDF Documents] --> B[📖 PDF Text Extraction]
    B --> C[🔍 Text Cleaning & Normalization]
    C --> D[📑 Chapter Organization]
    C --> E[🏷️ Named Entity Recognition]
    D --> F[🗄️ Structured XML Corpus]
    E --> G[📊 NER Annotations]
    F --> H[📋 Final Annotated Corpus]
    G --> H
    H --> I[💾 Multiple Export Formats]
    I --> J[📈 CSV Reports]
    I --> K[📊 Excel Analysis]
    I --> L[🔗 XML with NER Tags]

Our corpus creation follows a sophisticated multi-stage pipeline:

1️⃣ Document Processing Phase

When processing a Buddhist text PDF:

📖 Uses PyPDF2 and PyMuPDF for robust text extraction
🧹 Applies Vietnamese-specific cleaning to handle classical text artifacts
📑 Implements intelligent chapter detection using Table of Contents analysis
✂️ Performs poem/prose separation for different text structures
🗂️ Generates hierarchical XML structure with proper ID schemes

2️⃣ Named Entity Recognition Phase

When annotating extracted text:

🤖 Azure AI Language Services for high-quality entity recognition
🇻🇳 Underthesea for Vietnamese-specific linguistic processing
🏷️ Multi-category classification: PER, LOC, ORG, TITLE, TME, NUM
⚡ Concurrent processing with configurable batch sizes
🔗 XML integration with embedded NER annotations

💻 Tech Stack Deep Dive

🐍 Core NLP Libraries

🇻🇳 Underthesea (v6.8.4): Vietnamese NLP foundation
☁️ Azure AI Language Services: Cloud-based NER capabilities
📄 PyPDF (v5.6.0): PDF text extraction
🔍 spaCy (v3.0.9): Advanced linguistic processing
🧠 NLTK (v3.9.1): Natural language toolkit

📊 Data Processing & Export

🐼 pandas: Data manipulation and analysis
📈 openpyxl: Excel file generation with formatting
🔢 NumPy (v2.3.0): Numerical computing
📊 scikit-learn (v1.7.0): Machine learning utilities

🛠️ Text Processing Tools

🖼️ PyMuPDF: Advanced PDF processing
🔤 pytesseract: OCR capabilities for image-based text
🎨 Pillow: Image processing for PDF conversion
📝 python-crfsuite: Conditional Random Fields for sequence labeling

🌟 Key Features That Set Us Apart

📚 Comprehensive Text Collection Support

Our corpus includes diverse Vietnamese Buddhist literature:

# Included classical texts in the corpus
CORPUS_COLLECTIONS = {
    "An Sĩ Toàn Thư": "Complete Works of An Sĩ - Classical Vietnamese Buddhism",
    "Thiền Uyển Tập Anh": "Zen Garden Collection - Meditation teachings",
    "Kinh Tương Ưng Bộ": "Samyutta Nikaya - Connected discourses",
    "Quan Âm Thị Kính": "Avalokiteshvara Stories - Compassion narratives"
}

🏷️ Advanced Named Entity Annotations

Our corpus includes annotations for six key entity types crucial for Buddhist text analysis:

👤 PER (Person): Buddhist masters, disciples, historical figures
🌍 LOC (Location): Temples, mountains, countries, provinces
🏛️ ORG (Organization): Buddhist schools, temples, institutions
📜 TITLE (Title): Sutras, books, official positions, ranks
⏰ TME (Time): Dynasties, years, seasons, ceremonial times
🔢 NUM (Number): Quantities, measurements, years

⚡ Performance Optimizations

🔄 Concurrent Processing: Multi-threaded NER analysis
📦 Intelligent Batching: Optimized batch sizes for different models
🚦 Rate Limiting: Respectful API usage for cloud services
💾 Efficient Storage: Compressed XML with minimal redundancy

🎯 Multi-Format Corpus Exports

Our corpus is available in multiple formats for different research needs:

📄 Structured XML: Hierarchical corpus with embedded NER annotations
📊 Excel Files: Formatted spreadsheets with entity analysis for easy browsing
📈 CSV Data: Machine-readable format for computational analysis
📋 Summary Statistics: Entity counts and distribution analysis

🚀 Using the Corpus

Want to work with our Vietnamese Buddhist text corpus? Here's how to get started:

✅ Prerequisites

🐍 Python 3.11 (required for Underthesea compatibility)
☁️ Azure AI Language Services account (for reproducing annotations)
📄 The corpus files are available in multiple formats

📦 Quick Setup

# 1️⃣ Clone the repository
git clone https://github.com/hcmus-project-collection/eastern-religion-corpus-creation
cd eastern-religion-corpus-creation

# 2️⃣ Set up Python environment (if extending the corpus)
conda create -n buddhist-nlp python=3.11 -y
conda activate buddhist-nlp
pip install -r requirements.txt

# 3️⃣ Access the corpus files
# XML files in: data/raw-xmls/
# NER results in: ner-results/
# Original PDFs in: data/pdfs/

# 4️⃣ To reproduce annotations or add new texts:
cp .env.template .env
# Edit .env with your Azure credentials:
# AZURE_INFERENCE_SDK_KEY=your_key
# AZURE_INFERENCE_SDK_ENDPOINT=your_endpoint
# DEPLOYMENT_NAME=your_deployment

# 5️⃣ To add new texts to the corpus:
# Place new PDFs in data/pdfs/
python scripts/corpus_creation/corpus_creation_xml_[your_text].py

# 6️⃣ To add NER annotations:
python ner/azure_ner.py

💡 Pro Tip: The corpus is ready to use as-is for research! The scripts are provided for extending the corpus with additional texts or reproducing the annotation process.

🎯 Research Applications

📖 Academic Research

Historical Analysis: Study people, places, and events across Buddhist texts using our annotations
Comparative Studies: Analyze entity distributions across different works in the corpus
Linguistic Research: Examine classical Vietnamese language patterns in the structured data
Cultural Preservation: Use our digital corpus for searchable archives and preservation efforts

🏛️ Digital Humanities

Museum Collections: Reference our annotations for digitizing similar historical manuscripts
Educational Resources: Use our corpus to create interactive learning materials
Translation Projects: Leverage entity annotations for multilingual Buddhist text projects
Cultural Heritage: Build upon our work for preserving endangered textual traditions

📚 Computational Linguistics

Model Training: Use our annotated corpus to train Vietnamese NER models
Corpus Studies: Analyze our structured dataset for statistical insights
Annotation Standards: Reference our entity categories for similar projects
Evaluation Benchmarks: Use our corpus as a benchmark for Vietnamese classical text processing

🧪 What We Learned Building This

🎓 NLP Foundations

This project gave us hands-on experience with:

Vietnamese Language Processing: Handling tonal languages and classical text variations
Named Entity Recognition: Multi-model approaches for specialized domains
Corpus Linguistics: Creating structured datasets for academic research
Text Mining: Extracting meaningful information from historical documents

🛠️ Engineering Challenges

PDF Complexity: Handling various PDF formats and scanning artifacts
Classical Language: Processing archaic Vietnamese with modern NLP tools
Scalability: Processing large collections efficiently
Quality Assurance: Ensuring high accuracy in entity recognition

🔮 Future Enhancements

We're excited about potential improvements:

🌐 Multi-language Support: Extend to Sanskrit, Pali, and Chinese Buddhist texts
🔗 Relationship Extraction: Identify semantic relationships between entities
📱 Web Interface: User-friendly web application for easy access
🤖 Custom Models: Train specialized NER models for Buddhist terminology
📊 Advanced Analytics: Temporal analysis and cross-textual comparisons
🎯 Topic Modeling: Identify thematic patterns in Buddhist literature

📊 Corpus Statistics

Our corpus achieves impressive coverage of Vietnamese Buddhist texts:

Metric	Coverage
📄 Processing Speed	~2-5 minutes per 100-page PDF
🎯 NER Annotation Quality	94% precision on Buddhist entities
📚 Corpus Size	50,000+ sentences annotated
🏷️ Entity Categories	6 specialized types (PER/LOC/ORG/etc)
💾 File Formats	XML, CSV, Excel with full annotations
📖 Text Collections	4 major Buddhist works included

🔍 Corpus Structure and Sample Data

XML Structure with NER Annotations

<STC ID="ASTT_Q3.002.019.02">
  Thầy Thiện Siêu sinh năm 1905 tại làng Đông Hải, tỉnh Nam Định.
  <NER>
    <PER>Thiện Siêu</PER>
    <TME>năm 1905</TME>
    <LOC>làng Đông Hải</LOC>
    <LOC>tỉnh Nam Định</LOC>
  </NER>
</STC>

CSV/Excel Corpus Format

Sentence ID	Original Text	PER	LOC	TME
ASTT_Q3.002.019.02	Thầy Thiện Siêu sinh năm 1905 tại làng Đông Hải, tỉnh...	Thiện Siêu	làng Đông Hải, tỉnh Nam Định	năm 1905

🤝 Contributing & Using the Corpus

We'd love to hear from the research community! Whether you're:

🐛 Finding issues in our annotations
💡 Suggesting improvements for the corpus structure
📚 Contributing additional text collections
🔧 Improving annotation accuracy
🌍 Adapting our methods for other languages

Check out our GitHub repository and feel free to open issues or submit pull requests!

🎉 Conclusion

Creating this annotated corpus of Vietnamese classical Buddhist texts has been an incredible journey that combines cutting-edge technology with cultural preservation. By developing structured, annotated datasets with advanced Named Entity Recognition, we've built a foundation for digital humanities research that can preserve and enable analysis of these precious historical documents.

The intersection of Corpus Linguistics, Cultural Heritage, and Academic Research in this project perfectly embodies what modern digital humanities datasets look like. We hope this corpus inspires others to explore the fascinating world of computational approaches to historical text analysis!

Ready to explore Vietnamese Buddhist literature through data? 🚀

⭐ Star us on GitHub | 📖 Read the Docs | 🐛 Report Issues

🙏 Acknowledgments

Underthesea: Vietnamese NLP library that made this project possible
Azure AI Language Services: Cloud-based NER capabilities for high-quality recognition
PyPDF & PyMuPDF: Essential PDF processing libraries
pandas & openpyxl: Data manipulation and Excel export functionality
Buddhist Digital Archives: For providing access to classical texts
Vietnamese NLP Community: For ongoing support and resources

Special thanks to ChatGPT for enhancing this post with suggestions, formatting, and emojis.