Medical Named Entity Recognition (NER)

Applied NLP using spaCy & Active Learning

Project Overview

This project focuses on identifying complex biomedical entities—specifically Chemicals and Diseases—from the BioCreative-V CDR Corpus. Using spaCy, I developed a pipeline that combines rule-based matching with statistical model training to automate the extraction of life-science insights from unstructured text.

View Project on GitHub Open in Colab Dataset (GitHub - BioCreative-V CDR Corpus)


Methodology

The project was executed in four critical phases:

1. Data Preparation

Parsed the PubTator format, merging training and development sets to create a robust foundation for the model.

2. Rule-Based Matching (Weak Labeling)

Before training, I implemented a collection of patterns to identify entities through exact matches and regex. This served as a “weak supervision” layer to bootstrap the learning process.

3. Active Learning Loop

One of the most advanced features of this project is the Active Learning Loop. By iteratively training the model and querying uncertain samples, we optimized the labeling process, significantly reducing the amount of manual annotation required.

4. Model Training & Optimization

The model was built using spaCy’s transition-based NER parser. I utilized en_core_web_sm as a base and fine-tuned it on the specialized medical corpus.


Performance Results

The model was evaluated using a strict token-level matching criteria.

Metric Score
Precision 0.5819
Recall 0.3580
F1-Score 0.4433

Note: While the recall is conservative, the precision indicates that the model is highly reliable when it does identify an entity—a crucial requirement in medical diagnostics.


Complete Analysis Notebook

To view the full Python implementation, data parsing logic, and execution logs, please visit the technical notebook page.

TipProject Documentation

The complete methodology, data analysis, and model performance metrics are detailed in the full technical report.

Back to top