Medical Named Entity Recognition (NER)
Applied NLP using spaCy & Active Learning
Project Overview
This project focuses on identifying complex biomedical entities—specifically Chemicals and Diseases—from the BioCreative-V CDR Corpus. Using spaCy, I developed a pipeline that combines rule-based matching with statistical model training to automate the extraction of life-science insights from unstructured text.
View Project on GitHub Open in Colab Dataset (GitHub - BioCreative-V CDR Corpus)
Methodology
The project was executed in four critical phases:
1. Data Preparation
Parsed the PubTator format, merging training and development sets to create a robust foundation for the model.
2. Rule-Based Matching (Weak Labeling)
Before training, I implemented a collection of patterns to identify entities through exact matches and regex. This served as a “weak supervision” layer to bootstrap the learning process.
3. Active Learning Loop
One of the most advanced features of this project is the Active Learning Loop. By iteratively training the model and querying uncertain samples, we optimized the labeling process, significantly reducing the amount of manual annotation required.
4. Model Training & Optimization
The model was built using spaCy’s transition-based NER parser. I utilized en_core_web_sm as a base and fine-tuned it on the specialized medical corpus.
Performance Results
The model was evaluated using a strict token-level matching criteria.
| Metric | Score |
|---|---|
| Precision | 0.5819 |
| Recall | 0.3580 |
| F1-Score | 0.4433 |
Note: While the recall is conservative, the precision indicates that the model is highly reliable when it does identify an entity—a crucial requirement in medical diagnostics.
Complete Analysis Notebook
To view the full Python implementation, data parsing logic, and execution logs, please visit the technical notebook page.
The complete methodology, data analysis, and model performance metrics are detailed in the full technical report.