ChatGPT For Programming Education: AI Models Vs Human Tutors

Benchmarking state-of-the-art AI models against human experts reveals surprising results. GPT-4 shows remarkable progress but faces challenges in grading feedback and task creation

Generative AI and large language models (LLMs) have the potential to revolutionize programming education, and a new research paper titled “Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors” explores their effectiveness. The study conducted a comprehensive evaluation of two state-of-the-art models, ChatGPT (based on GPT-3.5) and GPT-4, comparing their performance with that of human tutors across various scenarios in programming education.

Tutoring : AI vs Humans

The research paper highlights that previous studies in this area were limited in scope, either using outdated models or focusing on specific scenarios. Thus, there was a need for a systematic study to benchmark the performance of state-of-the-art models in a wide range of programming education scenarios.

Benchmarking state-of-the-art AI models against human experts reveals surprising results. GPT-4 shows remarkable progress but faces challenges in grading feedback and task creation

In response to this need, the study has been conducted to systematically evaluate two models: ChatGPT (based on GPT-3.5) and GPT-4. The goal was to compare their performance with that of human tutors in various programming education scenarios. The study considered six different scenarios: program repair, hint generation, grading feedback, pair programming, contextualized explanation, and task creation.
Also Read : Improving Chart Comprehension and Accessibility: MIT’s Breakthrough in Autocaptioning

Evaluation Setup

To evaluate the models’ performance, the study used expert-based annotations and focused on five introductory Python programming problems. Buggy programs from an online platform were utilized to simulate real-world scenarios and capture different types of bugs. The evaluation results revealed that GPT-4 significantly outperformed ChatGPT (based on GPT-3.5) and came close to the performance of human tutors in several scenarios.

However, the study also identified certain scenarios where GPT-4 struggled, particularly in grading feedback and task creation. In these scenarios, the performance of GPT-4 was considerably lower compared to human tutors. These findings highlight the need for further research and development to enhance the performance of these models in challenging scenarios.

The study’s evaluation setup included prompts provided to the models for interaction, as well as outputs generated for evaluation. The models were compared against human tutors, who were experienced in Python programming and teaching introductory programming classes. The evaluation process involved 25 instances, each comprising a problem and program, for a total of 125 instances across all scenarios.
Also Read : World’s First Digital Factory : BMW and NVIDIA Unveil the Power of Digital Twins

Results & Future

The results of the study demonstrated that both GPT-4 and ChatGPT achieved high performance, with GPT-4 performing better overall. However, when compared to human tutors, GPT-4 still had room for improvement. The study provided aggregated results for various metrics, highlighting the performance gaps between the models and human tutors across different problems.

While the study offers valuable insights into the capabilities of generative AI models for programming education, it also acknowledges certain limitations. The evaluation involved a small number of human tutors, and future studies should aim to include a larger pool of experts. Additionally, the study focused exclusively on introductory Python programming, leaving room for similar research in other programming languages and domains. Exploring multilingual settings and including student-based assessments are also potential avenues for future work.
Also Read : Hummingbird: World’s First Optical Network-on-Chip Accelerator for AI Workloads adaptable to Tensorflow

Learner is the Winner 

In conclusion, the benchmarking study showcased the promising potential of generative AI and large language models in programming education. GPT-4 outperformed ChatGPT (based on GPT-3.5) and approached the performance of human tutors in various scenarios. However, there are still areas where improvements can be made, particularly in grading feedback and task creation. Future work should address these limitations and explore additional avenues for research and development in this field.

Reference : https://doi.org/10.48550/arXiv.2306.17156

Benchmarking state-of-the-art AI models against human experts reveals surprising results. GPT-4 shows remarkable progress but faces challenges in grading feedback and task creation

Get Weekly Updates!

We don’t spam! Read our privacy policy for more info.

Benchmarking state-of-the-art AI models against human experts reveals surprising results. GPT-4 shows remarkable progress but faces challenges in grading feedback and task creation

Get Weekly Updates!

We don’t spam! Read our privacy policy for more info.

🤞 Get Weekly Updates!

We don’t spam! Read more in our privacy policy

Share it Now on Your Channel