ChatGPT vs. Machine Learning: Assessing the Efficacy and Accuracy of Large Language Models for Automated Essay Scoring

Date:

Kim, Y. , Mozer, R. , Miratrix, L. W., Al-Adeimi, S. (2024, August)

We compare the efficacy and accuracy of Large Language Models (LLMs) to tree-based Machine Learning (ML) algorithms in the context of student essay grading. Using essays on the iPad usage in schools and the censorship in the libraries, we evaluate both essay opinion (categorical) and writing quality (continuous) against a human-scored standard. LLMs, with various prompting and fine-tuning techniques, are compared to ML algorithms trained on extensive statistical features. Our findings show that while LLMs have potential, particularly in essay classification, tree-based ML methods currently offer superior accuracy is assessing writing quality, irrespective of the types of outcomes. In contrast, most ChatGPT versions outperformed the majority of ML methods in essay classification, indicating their strength in detecting the presence of certain aspects. Our findings demonstrate the importance of prompting and fine-tuning techniques and highlight strengths and limitations of LLM and ML approaches in automated essay scoring. This study is useful for educators, researchers, and policymakers looking to use technology for more efficient and accurate essay grading.