ChatGPT vs. Machine Learning: Assessing the Efficacy and Accuracy of Large Language Models for Automated Essay Scoring
Date:
Kim, Y. , Mozer, R. , Miratrix, L. W., Al-Adeimi, S. (2024, March)
This study investigates the efficacy and accuracy of Large Language Models (LLMs) in comparison to tree-based Machine Learning (ML) algorithms in the context of student essay grading. Using essays from the Catalyzing Comprehension through Discussion and Debate project, the research evaluates both categorical (opinion classification) and continuous (quality assessment) outcomes, comparing these technologies to a human-scored standard. The LLMs with varied prompts and tuning levels are compared with ML algorithms using extensive statistical features. Our findings, based on RMSE, R-squared, accuracy, and kappa metrics, indicate that LLMs, specifically ChatGPT 3.5-Turbo, excel at categorizing essay opinions but struggle with continuous grading, even with fine-tuning. In contrast, ML models, while less adept at language understanding, outperform in quality assessment. These findings highlight each approach’s strengths and limitations in educational settings, offering insights for enhancing student writing evaluation. This research is vital for educators, researchers, and policymakers looking to use technology for more efficient and accurate essay grading.