INTRODUCTION
The rapid advancement of large language models (LLM) has notably impacted various fields, including education, business, scientific research, and healthcare. LLM-based generative artificial intelligence (AI) models, such as OpenAI’s ChatGPT, Google’s Gemini, and Meta’s Llama, utilize vast datasets and deep learning–based algorithms to generate human-like responses.1-3 These models have demonstrated remarkable capabilities in natural-language understanding, content generation, data analysis, and decision support, making them increasingly valuable in academic research, professional communication, customer service, and knowledge management.4-7 As AI technology continues to evolve, the potential of LLMs to enhance efficiency, improve access to information, and support complex problem solving across multiple domains is being increasingly explored.
Among the available generative AI models, ChatGPT has garnered considerable attention with regard to medical and rehabilitative applications. It has been widely employed for summarizing research papers, answering medical queries, and even providing preliminary diagnostic insights.8-10 In physical therapy, ChatGPT has proven its ability to deliver information regarding musculoskeletal disorders, rehabilitation techniques, and clinical guidelines, making it a valuable resource for clinicians, students, and educators. For example, a study analyzing ChatGPT’s responses for the shoulder impingement syndrome found that ChatGPT could provide definitions, risk factors, symptoms, and treatment options, including rehabilitation exercises.11 However, the same study also highlighted ChatGPT’s tendency to present biased or potentially inaccurate medical information, reinforcing the need for human oversight. Similarly, in orthopedic education, ChatGPT has been employed to simplify patient education materials associated with rotator cuff injuries, improving accessibility while maintaining medical accuracy.12 Further, in sports rehabilitation, ChatGPT has been integrated into patient support systems, allowing individuals to ask questions regarding treatment plans, exercise modifications, and recovery strategies while receiving real-time and personalized feedback on their rehabilitation progress.13 Furthermore, academic physical-therapy programs have initiated leveraging ChatGPT to assist in curriculum development, streamline research documentation, and create case-based learning materials, highlighting the growing role of ChatGPT in education and professional training.14 These applications underscore ChatGPT’s expanding role in physical therapy and rehabilitation and highlight the need for further validation to ensure clinical accuracy and alignment with evidence-based practices.
Despite the potential benefits of generative AI in healthcare, notable challenges remain. AI-generated responses can contain inaccurate information, hallucinated facts, and biases, raising concerns in terms of clinical accuracy and trustworthiness.15 Although ChatGPT and similar LLMs process vast datasets, they do not always provide responses consistent with evidence-based medical literature and may struggle with specialized terminology and clinical reasoning, which are essential for effective patient care.16 Moreover, the high computational costs associated with LLMs pose sustainability concerns.17 Training and deploying these models require energy-intensive processes, leading to high operational costs and a negative environmental impact.18 The need for continuous real-time processing during the operation of medical applications further amplifies these challenges, making it crucial to develop highly efficient and sustainable AI solutions before their widespread adoption in healthcare.19
As AI-driven medical applications have become more prevalent, the demand for cost-effective and computationally efficient alternatives to proprietary models such as ChatGPT has increased. One such emerging model is DeepSeek, a mixture-of-experts (MoE) LLM designed to offer high-performance language processing at reduced computational costs.20,21 DeepSeek employs the multihead latent attention (MLA) and DeepSeekMoE architectures, which improve inference efficiency and reduce overall training expenses. Unlike ChatGPT, which requires extensive computational resources, DeepSeek achieves competitive performance with only 2.788 million GPU hours of training, making it a more affordable option for AI-driven medical applications.21 DeepSeek has undergone supervised fine tuning and reinforcement learning to enhance its consistency with human-like responses, thereby improving its applicability across diverse domains.21 However, despite these advancements, DeepSeek has not been rigorously validated for clinical use in the fields of physical therapy and rehabilitation sciences. Unlike those of specialized models explicitly designed for healthcare applications, the responses of DeepSeek with regard to musculoskeletal diagnosis and treatment have not been systematically assessed for clinical accuracy, relevance, and reliability. The lack of formal validation raises concerns about its suitability for providing evidence-based medical guidance in physical therapy. Given these uncertainties, a direct performance comparison with ChatGPT is necessary to determine whether DeepSeek can serve as a viable tool for medical and rehabilitative applications.
The current study aims to compare the performances of ChatGPT and DeepSeek in generating responses related to physical therapy and rehabilitation sciences. The research focuses on evaluating accuracy, clinical relevance, and readability to determine whether low-cost LLMs such as DeepSeek can be viable alternatives to more established AI models in medical education and clinical practice. By conducting a structured comparative analysis, this study addresses the strengths and limitations of these AI models in providing evidence-based physical-therapy knowledge. The findings will provide valuable insights to AI developers, healthcare professionals, and educators, guiding the integration of generative AI into medical training and patient care. Ultimately, this research will contribute to the growing discourse on the role of AI in healthcare and help shape future advancements in medical AI applications.
METHODS
This study was designed as a technical evaluation employing a comparative qualitative analysis of AI-generated responses from two generative language models—ChatGPT and DeepSeek. Figure 1 outlines the study process. This study employed a comparative qualitative research design in line with established frameworks for systematically and context‐sensitively appraising new technologies in healthcare.22,23 Specifically, this study compared AI‐generated responses from ChatGPT and DeepSeek in musculoskeletal rehabilitation, integrating domain‐expert evaluations to assess each model’s depth, accuracy, and clinical relevance.
On January 31, 2025, AI-generated responses were collected from ChatGPT o1 and DeepSeek with the R1 functionality activated (Table 1). Both models were tested under identical conditions, without external fine-tuning or prompt modifications. This approach ensured comparison of the models’ baseline responses to medical questions related to musculoskeletal sciences and rehabilitation.
Six questions were selected based on their relevance to musculoskeletal sciences and rehabilitation. These questions, which covered musculoskeletal functions, movement impairments, clinical assessments, and postoperative management (Table 2), were derived from clinical scenarios and key biomechanical principles frequently encountered in physical-therapy practice.24 Each question was input into each model, after which the generated responses were collected without modification.
Responses were analyzed based on the following six predefined criteria (Table 3): (1) accuracy, the extent to which the involved response aligned with established medical and biomechanical knowledge; (2) coherence, the logical structure and flow of information within the response; (3) fluency, the clarity and readability of the language used; (4) reasoning ability, the depth of biomechanical analysis and logical explanation; (5) justification, the presence of supporting details, evidence, or rationale for the response; and (6) medical suitability, the relevance of the response for clinical and educational use in rehabilitation and physical therapy. These evaluation criteria were established by integrating methodological and conceptual frameworks drawn from recent investigations into AI-driven medical assessment and content analysis.25-27
Qualitative analysis according to evaluation criteria was performed by a single evaluator with more than 5 years of clinical experience in musculoskeletal therapy and more than 10 years of research experience in that field. Each response generated by ChatGPT and DeepSeek was evaluated using a structured five-point rating scale across six predefined criteria. The analysis process involved identifying notable differences in the scope and detail of content coverage, assessing the depth of biomechanical reasoning and clinical relevance, comparing the clarity and logical flow of each explanation, and recording observations on each model’s strengths and limitations in addressing domain-specific inquiries.
RESULTS
Responses generated by ChatGPT and DeepSeek to each question were collected, which are comparatively summarized in Table 4. And evaluation results for each criterion, rated on the five-point scale, are presented in Table 5. The full responses are presented in Supplementary File 1.
ChatGPT generated highly detailed and precise explanations, incorporating anatomical terminology, physiological principles, and clinical implications. It elaborated on complex biomechanical processes and included phase transitions, specific muscle activations, and pathological implications. Meanwhile, DeepSeek presented accurate and concise responses that summarized key concepts without comprehensive analysis. Although both models delivered factually correct information, the explanations of ChatGPT were more comprehensive than those of DeepSeek, whereas DeepSeek’s responses were optimized for quick understanding.
ChatGPT structured its responses in a hierarchical manner, progressing logically from basic definitions to clinical applications. Each section followed a clear sequence, enabling smooth information flow. Meanwhile, DeepSeek presented its responses in a bulleted list, which although enhanced readability but occasionally resulted in fragmented information that lacked connection between related concepts.
Both models exhibited high fluency in generating natural-sounding responses. However, the language used in ChatGPT’s responses resembled that used in academic or medical literature, rendering them more suitable for healthcare professionals and researchers. Meanwhile, DeepSeek used simpler language with a more direct communication style, making its content more accessible to nonspecialists, such as fitness professionals, patients, and general readers.
ChatGPT demonstrated strong reasoning ability, particularly in biomechanical explanations and clinical applications. It frequently expounded on cause–effect relations, compensatory mechanisms, and assessment methodologies, providing comprehensive responses regarding diagnostic and therapeutic approaches. Although DeepSeek was able to identify key elements of movement dysfunctions and rehabilitation approaches, it failed to expound as much as ChatGPT on clinical reasoning or underlying biomechanical principles.
ChatGPT consistently included strong justifications, referencing clinical assessments, rehabilitation protocols, and evidence-based treatment guidelines. It also provided rationale for each intervention and explained the biomechanical mechanisms behind specific conditions. Meanwhile, DeepSeek delivered short and practical answers but often lacked justification or supporting details for its statements; although it covered relevant information, it did not provide as much rationale as ChatGPT for clinical assessments or treatments.
ChatGPT’s responses, which contained detailed discussions, clinical assessments, and treatment strategies, were highly suitable for medical and rehabilitation professionals, catering to medical practitioners, physical therapists, and sports scientists who require detailed and research-backed explanations. However, DeepSeek’s responses were more suitable for general audiences, fitness trainers, and individuals seeking practical takeaways without excessive technical details; its focus on clarity and conciseness made it more suitable for nonmedical professionals.
DISCUSSION
The current study comparatively evaluated the performances of ChatGPT and DeepSeek in generating responses related to musculoskeletal sciences and rehabilitation. Overall, our findings showed that ChatGPT demonstrated superior ability to provide detailed, clinically relevant responses that incorporated anatomical terminology, physiological principles, and evidence-based rehabilitation strategies. Its responses were structured and offered hierarchical explanations that aligned with established medical education frameworks. In contrast, DeepSeek produced responses that were notably more concise and computationally efficient. This brevity allowed for quick retrieval of essential facts but often lacked depth in biomechanical explanations and justification of its recommendations based on underlying anatomical or physiological principles.
Prior research has extensively explored the application of AI in medical education, diagnostics, and rehabilitation, highlighting both the advantages and challenges of integrating LLM into healthcare and professional training.28-30 Previous studies have demonstrated that LLM can enhance clinical decision-making by rapidly synthesizing vast amounts of medical knowledge and providing structured, evidence-based responses.31,32 Compared to traditional clinical decision-support systems, which primarily function as rule-based algorithms, LLMs such as ChatGPT offer a more dynamic and context-sensitive approach by integrating multimodal data and providing explanations that go beyond rigid protocol adherence.33,34 However, despite these advantages, concerns regarding AI-generated hallucinations remain a notable limitation.35,36 Prior studies investigating AI reliability in clinical decision support have identified instances wherein models fabricate non-existent conditions, misinterpret medical guidelines, or provide inaccurate citations.35,36 Compared to rule-based expert systems, which strictly adhere to predefined medical guidelines, LLMs can occasionally generate plausible-sounding but incorrect information due to their probabilistic nature. This limitation underscores the need for human oversight in AI-assisted medical decision-making.
ChatGPT’s detailed explanations and biomechanical reasoning suggest its potential use in medical education and professional training. AI-driven educational tools enhance student engagement and understanding of complex physiological processes, particularly in musculoskeletal sciences.37,38 Unlike traditional educational tools, which often rely on static content delivery, ChatGPT enables interactive and adaptive learning experiences, allowing students to engage in case-based reasoning and receive context-sensitive explanations.39 This capability renders it particularly valuable in clinical training, where real-time feedback and exposure to diverse patient scenarios are crucial for skill development. Furthermore, as AI-driven models continue to evolve, their integration into medical curricula can complement existing pedagogical approaches, bridging the gap between theoretical knowledge and practical application.
From a resource efficiency perspective, however, the computational demands of ChatGPT raise concerns about its sustainability and scalability in healthcare settings. The energy demands of LLMs are quite substantial, with recent studies indicating that inference now surpasses training in energy consumption, which could have notable environmental impacts.40 For instance, serving a single ChatGPT prompt generates >4 g of CO2 equivalent emissions, corresponding to >20 times the carbon footprint of a typical web search.40 Moreover, depending on the GPU platform and batch size, LLM inference can exhibit notable trade-offs between latency, energy efficiency, and total carbon emissions, necessitating optimized AI infrastructure to mitigate environmental impact.40 In contrast, DeepSeek’s optimized computational efficiency could potentially be more energy-efficient, particularly for applications wherein concise, fact-based responses may be prioritized, such as telemedicine consultations, patient education, and preliminary assessments.20 DeepSeek is designed to leverage the MoE architecture and MLA strategies, which can considerably reduce inference costs while maintaining high performance across reasoning tasks.20 Moreover, DeepSeek’s reinforcement learning-driven models, such as DeepSeek-R1, have the potential to exhibit faster response times and lower hardware requirements, making them possible alternatives for resource-constrained environments, including rural healthcare settings and mobile health applications.21 Considering the characteristics of these language models, we believe that a combined approach integrating DeepSeek’s computational efficiency with ChatGPT’s advanced clinical reasoning capabilities may be necessary. Such a model could enable dynamic switching between highly detailed reasoning and rapid, low-power responses, potentially enhancing real-world applicability while mitigating the environmental impact of AI-driven healthcare solutions.
This research has some limitations. First, this study was conducted by a single evaluator without the participation of multiple experts, and thus, an objective reliability assessment such as inter-rater agreement could not be performed. In addition, a blind procedure to reduce bias was not implemented. The involved evaluation was qualitative, relying on expert analysis rather than quantitative metrics, which may introduce subjectivity. Future research should involve multiple evaluators and incorporate objective reliability measures to strengthen the validity of the findings. Additionally, this study assessed AI models based on relatively simple, information-based questions. To further explore AI’s potential in clinical decision-making, future studies should incorporate complex patient scenarios that reflect real-world clinical reasoning processes. Furthermore, this study primarily focused on physical therapy; hence, expanding the dataset to include broader rehabilitation topics could provide a more comprehensive evaluation of AI performance across medical disciplines. Further studies should also explore the integration of AI feedback mechanisms to enhance response accuracy, allowing models to learn from clinician interactions and continuously refine their outputs. Addressing these limitations will enhance the reliability and real-world applicability of AI in rehabilitation sciences, ultimately improving patient care and medical education.
CONCLUSIONS
The current study compared the ability of ChatGPT and DeepSeek in providing response related to musculoskeletal sciences and rehabilitation. Overall, our results showed that ChatGPT provided detailed, structured, and clinically relevant explanations, emphasizing its utility for medical professionals and educators. In contrast, DeepSeek, although concise and computationally efficient, offered quick, easy-to-understand responses that lacked depth in biomechanical reasoning, highlighting its suitability for nonspecialists. ChatGPT excels in clinical reasoning, whereas DeepSeek is more efficient at rapid information retrieval. A hybrid approach combining ChatGPT’s depth with DeepSeek’s efficiency could balance accuracy with resource sustainability and optimize AI applications in healthcare.