Skip to main content

Empowering Indian Healthcare AI with High-Quality Data

At Eka Care, we believe that advancing healthcare in India requires AI solutions built specifically for Indian populations. To support this vision, we are proud to share comprehensive, anonymized healthcare datasets with the research community.

Available Datasets

Eka-Medical-ASR Evaluation Dataset

  • A curated collection of 3,900+ English and Hindi medical speech recordings designed to benchmark and improve speech-to-text performance for Indian healthcare use cases.
  • Built with diverse accents and real clinical terminology, this dataset enables robust evaluation of AI-powered medical scribe and voice-enabled healthcare applications.
  • AIKosh | HuggingFace

Medical Records Parsing Validation Set

  • A curated dataset of 288 de-identified lab reports and prescriptions designed to evaluate AI systems that extract structured data from unstructured medical documents.
  • Expert-annotated with rubric-based LLM evaluation, it ensures clinically accurate benchmarking across diverse Indian healthcare document formats.
  • AIKosh | HuggingFace

EkaCare Medical History Summarisation

  • A curated set of 58 real-world medical cases designed to evaluate AI systems in generating concise, clinically relevant summaries of patient histories.
  • Expert-defined rubrics and reference summaries ensure objective assessment of key developments, historical context, and critical care insights from the most recent six months of medical data.
  • AIKosh | HuggingFace

Clinical Note Generation Dataset

  • A multilingual dataset of 156+ transcribed doctor–patient conversations designed to evaluate AI systems that convert medical dialogues into structured, entity-level clinical records.
  • Expert-annotated ground truth JSON and rubric-based LLM evaluation ensure accurate benchmarking of structured note generation for EHR-ready medical documentation.
  • AIKosh | HuggingFace

Eka-IndicMTEB

  • A multilingual medical embedding benchmark with 2,532 doctor-verified queries across 8 Indic languages, aligned to SNOMED CT for concept-level evaluation.
  • Designed to benchmark and improve cross-lingual medical retrieval and semantic search systems across India’s diverse linguistic landscape.
  • AIKosh | HuggingFace

NidaanKosh

A Comprehensive Laboratory Investigation Dataset of 100,000 Indian Subjects with 6.8 Million+ Readings
  • 100,000 Indian Subjects
  • 6.8 Million+ Laboratory Readings
  • Covers common biomarkers and laboratory values specific to Indian populations
  • AIKosh | HuggingFace

Spandan

A Large Photoplethysmography (PPG) Signal Dataset of 1 Million+ Indian Subjects
  • 1 Million+ Indian Subjects
  • Raw PPG signals captured from diverse demographic groups across India
  • Essential for developing accurate cardiovascular monitoring algorithms for Indian populations
  • AIKosh | HuggingFace

Why These Datasets Matter

Bridging the Data Gap

Majority healthcare AI models developed today are trained on Western datasets, which may not accurately represent Indian patients’ unique characteristics. These datasets address this critical gap.

Enabling Homegrown Innovation

With access to high-quality Indian healthcare data, researchers and developers can build AI solutions tailored specifically to Indian healthcare challenges.

Advancing Healthcare Equity

By democratizing access to these datasets, we aim to support broader participation in healthcare AI development across India.

How to Use These Datasets

  1. Access: All datasets are hosted on the India AI Aikosh platform
  2. Documentation: Each dataset includes comprehensive documentation describing data structure, collection methodology, and suggested applications
  3. Community: Join our researcher community forum to connect with others using these datasets

Ethics & Privacy

All datasets have been rigorously anonymized following industry best practices and ethical guidelines. No personally identifiable information is included in any dataset.

Research Collaboration

We welcome collaboration opportunities with academic institutions, research organizations, and industry partners. If you’re using our datasets for your research:
  • Please cite our datasets in your publications
  • Share your findings with our research community
  • Consider opportunities for joint research initiatives
Contact our research team
These datasets are provided for research and development purposes. For terms of use and licensing information, please refer to the documentation included with each dataset.