Empowering Indian Healthcare AI with High-Quality Data
At Eka Care, we believe that advancing healthcare in India requires AI solutions built specifically for Indian populations. To support this vision, we are proud to share comprehensive, anonymized healthcare datasets with the research community.Available Datasets
Eka-Medical-ASR Evaluation Dataset
- A curated collection of 3,900+ English and Hindi medical speech recordings designed to benchmark and improve speech-to-text performance for Indian healthcare use cases.
- Built with diverse accents and real clinical terminology, this dataset enables robust evaluation of AI-powered medical scribe and voice-enabled healthcare applications.
- AIKosh | HuggingFace
Medical Records Parsing Validation Set
- A curated dataset of 288 de-identified lab reports and prescriptions designed to evaluate AI systems that extract structured data from unstructured medical documents.
- Expert-annotated with rubric-based LLM evaluation, it ensures clinically accurate benchmarking across diverse Indian healthcare document formats.
- AIKosh | HuggingFace
EkaCare Medical History Summarisation
- A curated set of 58 real-world medical cases designed to evaluate AI systems in generating concise, clinically relevant summaries of patient histories.
- Expert-defined rubrics and reference summaries ensure objective assessment of key developments, historical context, and critical care insights from the most recent six months of medical data.
- AIKosh | HuggingFace
Clinical Note Generation Dataset
- A multilingual dataset of 156+ transcribed doctor–patient conversations designed to evaluate AI systems that convert medical dialogues into structured, entity-level clinical records.
- Expert-annotated ground truth JSON and rubric-based LLM evaluation ensure accurate benchmarking of structured note generation for EHR-ready medical documentation.
- AIKosh | HuggingFace
Eka-IndicMTEB
- A multilingual medical embedding benchmark with 2,532 doctor-verified queries across 8 Indic languages, aligned to SNOMED CT for concept-level evaluation.
- Designed to benchmark and improve cross-lingual medical retrieval and semantic search systems across India’s diverse linguistic landscape.
- AIKosh | HuggingFace
NidaanKosh
A Comprehensive Laboratory Investigation Dataset of 100,000 Indian Subjects with 6.8 Million+ Readings- 100,000 Indian Subjects
- 6.8 Million+ Laboratory Readings
- Covers common biomarkers and laboratory values specific to Indian populations
- AIKosh | HuggingFace
Spandan
A Large Photoplethysmography (PPG) Signal Dataset of 1 Million+ Indian Subjects- 1 Million+ Indian Subjects
- Raw PPG signals captured from diverse demographic groups across India
- Essential for developing accurate cardiovascular monitoring algorithms for Indian populations
- AIKosh | HuggingFace
Why These Datasets Matter
Bridging the Data Gap
Majority healthcare AI models developed today are trained on Western datasets, which may not accurately represent Indian patients’ unique characteristics. These datasets address this critical gap.Enabling Homegrown Innovation
With access to high-quality Indian healthcare data, researchers and developers can build AI solutions tailored specifically to Indian healthcare challenges.Advancing Healthcare Equity
By democratizing access to these datasets, we aim to support broader participation in healthcare AI development across India.How to Use These Datasets
- Access: All datasets are hosted on the India AI Aikosh platform
- Documentation: Each dataset includes comprehensive documentation describing data structure, collection methodology, and suggested applications
- Community: Join our researcher community forum to connect with others using these datasets
Ethics & Privacy
All datasets have been rigorously anonymized following industry best practices and ethical guidelines. No personally identifiable information is included in any dataset.Research Collaboration
We welcome collaboration opportunities with academic institutions, research organizations, and industry partners. If you’re using our datasets for your research:- Please cite our datasets in your publications
- Share your findings with our research community
- Consider opportunities for joint research initiatives
These datasets are provided for research and development purposes. For terms of use and licensing information, please refer to the documentation included with each dataset.

