Evaluating ChatGPT4o as a teacher assessment aid: Reliability and feedback in EFL speaking classrooms
Main Article Content
Abstract
This study examines the potential of ChatGPT4o as a teacher assessment aid in EFL speaking classrooms, with a focus on its reliability in holistic and analytic scoring and the usefulness of its qualitative feedback. Thirty EFL speech samples from Chinese university students were evaluated by both ChatGPT4o and four experienced EFL speaking teachers. The results showed that while ChatGPT4o consistently provided detailed feedback across the domains of accuracy, fluency, and complexity, its scoring reliability was slightly lower than that of the human assessors. The teachers expressed a range of views on the adoption of ChatGPT4o, acknowledging its ability to deliver comprehensive feedback but also highlighting concerns about its accessibility, cost, and lack of contextual understanding. Despite these limitations, the study suggests that ChatGPT4o can be a valuable complemenatary tool for EFL speaking assessments, particularly in enhancing feedback delivery and scoring consistency. However, it is recommended that ChatGPT4o be used alongside human judgment to ensure a balanced and effective assessment approach. These findings have important implications for EFL teachers considering the integration of AI tools into their classroom practices, emphasizing the need for careful implementation and consideration of local contextual factors.
Downloads
Article Details
Section

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
American Educational Research Association. (2011). Code of ethics. Educational Researcher, 40(3), 145-
156. https://doi.org/10.3102/0013189X11410403
Ansari, A. N., Ahmad, S., & Bhutta, S. M. (2023). Mapping the global evidence around the use of ChatGPT
in higher education: A systematic scoping review. Education and Information Technologies.
https://doi.org/10.1007/s10639-023-12223-4
Barrot, J. S. (2023). Using ChatGPT for second language writing: Pitfalls and potentials. Assessing
Writing, 57, 100745. https://doi.org/10.1016/j.asw.2023.100745
Belmamoune, S. (2022). Self-assessment of physics students’ oral presentations to enhance their
English speaking skills. Humanization Journal for Researches and Studies, 13(2), 243-259.
Brennan, R. L. (2001). Generalizability theory. Springer-Verlag.
https://link.springer.com/10.1007/978-1-4757-3456-0
Cao, S., & Zhong, L. (2023). Exploring the effectiveness of ChatGPT-based feedback compared with
teacher feedback and self-feedback: Evidence from Chinese to English translation. arXiv
preprint. https://doi.org/10.48550/arxiv.2309.01645
Carless, D., & Boud, D. (2018). The development of student feedback literacy: Enabling uptake of
feedback. Assessment & Evaluation in Higher Education, 43(8), 1315-1325.
https://doi.org/10.1080/02602938.2018.1463354
Carless, D., Salter, D., Yang, M., & Lam, J. (2011). Developing sustainable feedback practices. Studies in
Higher Education, 36(4), 395-407.
Cavalcanti, A. P., Barbosa, A., Carvalho, R., Freitas, F., Tsai, Y.-S., Gašević, D., & Mello, R. F. (2021).
Automatic feedback in online learning environments: A systematic literature review. Computers
and Education: Artificial Intelligence, 2, 100027.
https://doi.org/10.1016/j.caeai.2021.100027
Creswell, J. W., & Creswell, J. D. (2023). Research design: Qualitative, quantitative, and mixed methods
approaches (6th ed.). Sage Publications.
Crick, J. E., & Brennan, R. L. (1983). Manual for GENOVA: A generalized analysis of variance system
(Version 2.1). The American College Testing Program.
Cronbach, L. J., Goldine, C. Gleser, H. N., & Rajaratnam, N. (1972). The dependability of behavioral
measurements: Theory of generalizability for scores and profiles. John Wiley & Sons, Inc.
Dehghani, H., & Mashhadi, A. (2024). Exploring Iranian English as a foreign language teachers’
acceptance of ChatGPT in English language teaching: Extending the technology acceptance
model. Education and Information Technologies, 1-22. https://doi.org/10.1007/s10639-024-12660-9
de Jong, N. H. (2023). Assessing second language speaking proficiency. Annual Review of Linguistics, 9,
541-560. https://doi.org/10.1146/annurev-linguistics-030521-052114
Farazouli, A., Cerratto-Pargman, T., Bolander-Laksov, K., & McGrath, C. (2024). Hello GPT!
Goodbye home examination? An exploratory study of AI chatbots impact on university teachers’
assessment practices. Assessment & Evaluation in Higher Education, 1-13.
https://doi.org/10.1080/02602938.2023.2241676
Gibbs, G., & Simpson, C. (2004). Conditions under which assessment supports students’ learning.
Learning and Teaching in Higher Education, 1, 18-19.
Guo, K., & Wang, D. (2023). To resist it or to embrace it? Examining ChatGPT’s potential to support
teacher feedback in EFL writing. Education and Information Technologies.
https://doi.org/10.1007/s10639-023-12146-0
Guo, K., Wang, J., & Chu, S. K. W. (2022). Using chatbots to scaffold EFL students’ argumentative writing.
Assessing Writing, 54, 100666. https://doi.org/10.1016/j.asw.2022.100666
Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81-112.
Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments?—A
generalizability theory approach. Assessing Writing, 13(3), 201-218.
https://doi.org/10.1016/j.asw.2008.10.002
Huang, J. (2012). Using generalizability theory to examine the accuracy and validity of large-scale ESL
writing assessment. Assessing Writing, 17(3), 123-139. https://doi.org/10.1016/j.asw.2011.12.003
Huang, B. H., Bailey, A. L., & Chang, Y. S. (2020). An investigation of the validity of a speaking assessment
for adolescent English language learners. Language Testing, 38(3).
https://doi.org/10.1177/0265532220925731
Huang, J., & Whipple, P. B. (2023). Rater variability and reliability of constructed response questions in
New York state high-stakes tests of English language arts and mathematics: Implications for
educational assessment policy. Humanities & Social Sciences Communications, 10(1), 1-10. https://doi.org/10.1057/s41599-023-02385-4
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023).
ChatGPT for good? On opportunities and challenges of large language models for education.
Learning and Individual Differences, 103, 102274.
Lee, M., & Evans, M. (2019). Investigating the operating mechanisms of the sources of L2 writing self-
efficacy at the stages of giving and receiving peer feedback. The Modern Language Journal, 103, 831-847. https://doi.org/10.1111/modl.1259
Li, J., Huang, J., & Cheng, S. (2022). The reliability, effectiveness, and benefits of peer assessment in
college EFL speaking classrooms: Student and teacher perspectives.
Studies in Educational Evaluation, 72, 101120.
Li, J., Huang, J., Wu, W., & Whipple, P. (2024). Evaluating the role of ChatGPT in enhancing EFL writing
assessments in classroom settings: A preliminary investigation. Humanities and Social Sciences
Communications, 11, 1268. https://doi.org/10.1057/s41599-024-03755-2
Link, S., Mehrzad, M., & Rahimi, M. (2022). Impact of automated writing evaluation on teacher
feedback, student revision, and writing improvement. Computer Assisted Language Learning,
35(4), 605-634. https://doi.org/10.1080/09588221.2020.1743323
Liu, G., & Ma, C. (2024). Measuring EFL learners’ use of ChatGPT in informal digital learning of English
based on the technology acceptance model. Innovation in Language Learning and Teaching,
18(2), 125-138. https://doi.org/10.1080/17501229.2023.2240316
Liu, Y., & Huang, J. (2020). The quality assurance of a national English writing assessment: Policy
implications for quality improvement. Studies in Educational Evaluation, 67, 100941.
https://doi.org/10.1016/j.stueduc.2020.100941
Lu, Q., Yao, Y., Xiao, L., Yuan, M., Wang, J., & Zhu, X. (2024). Can ChatGPT effectively complement
teacher assessment of undergraduate students’ academic writing? Assessment & Evaluation in
Higher Education, 1-18. https://doi.org/10.1080/02602938.2024.2301722
Pang, T., Kootsookos, A., & Cheng, C-T. (2024). Artificial intelligence use in feedback: A qualitative
analysis. Journal of University Teaching and Learning Practice, 21(6), 1-18.
Praphan, P. W., & Praphan, K. (2023). AI technologies in the ESL/EFL writing classroom: The villain or the
champion? Journal of Second Language Writing, 62, 101072.
https://doi.org/10.1016/j.jslw.2023.101072
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Sage Publications.
Song, C., & Song, Y. (2023). Enhancing academic writing skills and motivation: Assessing the efficacy of
ChatGPT in AI-assisted language learning for EFL students. Frontiers in Psychology, 14, 1260843. https://doi.org/10.3389/fpsyg.2023.1260843
Su, Y., Lin, Y., & Lai, C. (2023). Collaborating with ChatGPT in argumentative writing classrooms.
Assessing Writing, 57, 100752. https://doi.org/10.1016/j.asw.2023.100752
Sun, Q., Chen, F., & Yin, S. (2023). The role and features of peer assessment feedback in college English
writing. Frontiers in Psychology, 13, 1070618. https://doi.org/10.3389/fpsyg.2022.1070618
Wu, W., Huang, J., Han, C., & Zhang, J. (2022). Evaluating peer feedback as a reliable and valid
complementary aid to teacher feedback in EFL writing classrooms: A feedback giver perspective.
Studies in Educational Evaluation, 73, 101140. https://doi.org/10.1016/j.stueduc.2022.101140
Yan, D. (2023). Impact of ChatGPT on learners in a L2 writing practicum: An exploratory investigation.
Education and Information Technologies, 28(11), 13943–13967. https://doi.org/10.1007/s10639-023-11742-4
Zhang, R., Zou, D., & Cheng, G. (2023a). Chatbot-based learning of logical fallacies in EFL writing:
Perceived effectiveness in improving target knowledge and learner motivation. Interactive
Learning Environments, 1-18. https://doi.org/10.1080/10494820.2023.2220374
Zhang, R., Zou, D., & Cheng, G. (2023b). Chatbot-based training on logical fallacy in EFL argumentative
writing. Innovation in Language Learning and Teaching, 17(5), 932-945.
https://doi.org/10.1080/17501229.2023.2197417
Zhao, Q. (2010). The scorer reliability of the writing section of the HSK: A case study of experienced and
inexperienced scorers. China Examinations, 10, 13-19.
Zhao, C., & Huang, J. (2020). The impact of the scoring system of a large-scale standardized EFL writing
assessment on its score variability and reliability: Implications for assessment policy makers.
Studies in Educational Evaluation, 67, 100911. https://doi.org/10.1016/j.stueduc.2020.100911
Zou, M., & Huang, L. (2023a). The impact of ChatGPT on L2 writing and expected responses: Voice from
doctoral students. Education and Information Technologies.
https://doi.org/10.1007/s10639-023-12397-x
Zou, M., & Huang, L. (2023b). To use or not to use? Understanding doctoral students’ acceptance of
ChatGPT in writing through technology acceptance model. Frontiers in Psychology, 14, 1259531. https://doi.org/10.3389/fpsyg.2023.1259531
Zhu, Y., Feng, S. L., & Xin, T. (2013). Improving dependability of New HSK writing test score: A
generalizability theory-based approach. Journal of Psychological Science, 36, 479-488.