Empowering Indian Healthcare AI with High-Quality Data

At Eka Care, we believe that advancing healthcare in India requires AI solutions built specifically for Indian populations. To support this vision, we are proud to share comprehensive, anonymized healthcare datasets with the research community.

Available Datasets

Eka-Medical-ASR Evaluation Dataset

A curated collection of 3,900+ English and Hindi medical speech recordings designed to benchmark and improve speech-to-text performance for Indian healthcare use cases.
Built with diverse accents and real clinical terminology, this dataset enables robust evaluation of AI-powered medical scribe and voice-enabled healthcare applications.
AIKosh | HuggingFace

Medical Records Parsing Validation Set

A curated dataset of 288 de-identified lab reports and prescriptions designed to evaluate AI systems that extract structured data from unstructured medical documents.
Expert-annotated with rubric-based LLM evaluation, it ensures clinically accurate benchmarking across diverse Indian healthcare document formats.
AIKosh | HuggingFace

EkaCare Medical History Summarisation

A curated set of 58 real-world medical cases designed to evaluate AI systems in generating concise, clinically relevant summaries of patient histories.
Expert-defined rubrics and reference summaries ensure objective assessment of key developments, historical context, and critical care insights from the most recent six months of medical data.
AIKosh | HuggingFace

Clinical Note Generation Dataset

A multilingual dataset of 156+ transcribed doctor–patient conversations designed to evaluate AI systems that convert medical dialogues into structured, entity-level clinical records.
Expert-annotated ground truth JSON and rubric-based LLM evaluation ensure accurate benchmarking of structured note generation for EHR-ready medical documentation.
AIKosh | HuggingFace

Eka-IndicMTEB

A multilingual medical embedding benchmark with 2,532 doctor-verified queries across 8 Indic languages, aligned to SNOMED CT for concept-level evaluation.
Designed to benchmark and improve cross-lingual medical retrieval and semantic search systems across India’s diverse linguistic landscape.
AIKosh | HuggingFace

NidaanKosh

A Comprehensive Laboratory Investigation Dataset of 100,000 Indian Subjects with 6.8 Million+ Readings

100,000 Indian Subjects
6.8 Million+ Laboratory Readings
Covers common biomarkers and laboratory values specific to Indian populations
AIKosh | HuggingFace

Spandan

A Large Photoplethysmography (PPG) Signal Dataset of 1 Million+ Indian Subjects

1 Million+ Indian Subjects
Raw PPG signals captured from diverse demographic groups across India
Essential for developing accurate cardiovascular monitoring algorithms for Indian populations
AIKosh | HuggingFace

Why These Datasets Matter

Bridging the Data Gap

Majority healthcare AI models developed today are trained on Western datasets, which may not accurately represent Indian patients’ unique characteristics. These datasets address this critical gap.

Enabling Homegrown Innovation

With access to high-quality Indian healthcare data, researchers and developers can build AI solutions tailored specifically to Indian healthcare challenges.

Advancing Healthcare Equity

By democratizing access to these datasets, we aim to support broader participation in healthcare AI development across India.

How to Use These Datasets

Access: All datasets are hosted on the India AI Aikosh platform
Documentation: Each dataset includes comprehensive documentation describing data structure, collection methodology, and suggested applications
Community: Join our researcher community forum to connect with others using these datasets

Ethics & Privacy

All datasets have been rigorously anonymized following industry best practices and ethical guidelines. No personally identifiable information is included in any dataset.

Research Collaboration

We welcome collaboration opportunities with academic institutions, research organizations, and industry partners. If you’re using our datasets for your research:

Please cite our datasets in your publications
Share your findings with our research community
Consider opportunities for joint research initiatives

Contact our research team

These datasets are provided for research and development purposes. For terms of use and licensing information, please refer to the documentation included with each dataset.

AI-Technologies

Medical Knowledge-bases

Highlights

Datasets

Empowering Indian Healthcare AI with High-Quality Data

Available Datasets

Eka-Medical-ASR Evaluation Dataset

Medical Records Parsing Validation Set

EkaCare Medical History Summarisation

Clinical Note Generation Dataset

Eka-IndicMTEB

NidaanKosh

Spandan

Why These Datasets Matter

Bridging the Data Gap

Enabling Homegrown Innovation

Advancing Healthcare Equity

How to Use These Datasets

Ethics & Privacy

Research Collaboration

​Empowering Indian Healthcare AI with High-Quality Data

​Available Datasets

​Eka-Medical-ASR Evaluation Dataset

​Medical Records Parsing Validation Set

​EkaCare Medical History Summarisation

​Clinical Note Generation Dataset

​Eka-IndicMTEB

​NidaanKosh

​Spandan

​Why These Datasets Matter

​Bridging the Data Gap

​Enabling Homegrown Innovation

​Advancing Healthcare Equity

​How to Use These Datasets

​Ethics & Privacy

​Research Collaboration

Empowering Indian Healthcare AI with High-Quality Data

Available Datasets

Eka-Medical-ASR Evaluation Dataset

Medical Records Parsing Validation Set

EkaCare Medical History Summarisation

Clinical Note Generation Dataset

Eka-IndicMTEB

NidaanKosh

Spandan

Why These Datasets Matter

Bridging the Data Gap

Enabling Homegrown Innovation

Advancing Healthcare Equity

How to Use These Datasets

Ethics & Privacy

Research Collaboration