Evaluating ChatGPT4o as a teacher assessment aid: Reliability and feedback in EFL speaking classrooms

Yan Zhang; Jinyan Huang; Yanan Deng; Jun Yan; Zhen Li

doi:10.5944/vol28n4a379

Published: 2025-11-15

DOI: https://doi.org/10.5944/vol28n4a379

Keywords:

ChatGPT4o, EFL speaking assessment, scoring reliability, feedback usefulness, generalizability (G-) theory

Yan Zhang

Jiaxing Nanhu University, China

Jinyan Huang

Jiangsu University, China

Yanan Deng

Jiangsu University, China

Jun Yan

St. John’s University, USA

Zhen Li

Zhejiang Fashion Institute of Technology, China

Abstract

This study examines the potential of ChatGPT4o as a teacher assessment aid in EFL speaking classrooms, with a focus on its reliability in holistic and analytic scoring and the usefulness of its qualitative feedback. Thirty EFL speech samples from Chinese university students were evaluated by both ChatGPT4o and four experienced EFL speaking teachers. The results showed that while ChatGPT4o consistently provided detailed feedback across the domains of accuracy, fluency, and complexity, its scoring reliability was slightly lower than that of the human assessors. The teachers expressed a range of views on the adoption of ChatGPT4o, acknowledging its ability to deliver comprehensive feedback but also highlighting concerns about its accessibility, cost, and lack of contextual understanding. Despite these limitations, the study suggests that ChatGPT4o can be a valuable complemenatary tool for EFL speaking assessments, particularly in enhancing feedback delivery and scoring consistency. However, it is recommended that ChatGPT4o be used alongside human judgment to ensure a balanced and effective assessment approach. These findings have important implications for EFL teachers considering the integration of AI tools into their classroom practices, emphasizing the need for careful implementation and consideration of local contextual factors.

Downloads

Download data is not yet available.

Issue

Vol. 28 No. 4 (2025)

Section

Educational Technology

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

References

American Educational Research Association. (2011). Code of ethics. Educational Researcher, 40(3), 145-

156. https://doi.org/10.3102/0013189X11410403

Ansari, A. N., Ahmad, S., & Bhutta, S. M. (2023). Mapping the global evidence around the use of ChatGPT

in higher education: A systematic scoping review. Education and Information Technologies.

https://doi.org/10.1007/s10639-023-12223-4

Barrot, J. S. (2023). Using ChatGPT for second language writing: Pitfalls and potentials. Assessing

Writing, 57, 100745. https://doi.org/10.1016/j.asw.2023.100745

Belmamoune, S. (2022). Self-assessment of physics students’ oral presentations to enhance their

English speaking skills. Humanization Journal for Researches and Studies, 13(2), 243-259.

Brennan, R. L. (2001). Generalizability theory. Springer-Verlag.

https://link.springer.com/10.1007/978-1-4757-3456-0

Cao, S., & Zhong, L. (2023). Exploring the effectiveness of ChatGPT-based feedback compared with

teacher feedback and self-feedback: Evidence from Chinese to English translation. arXiv

preprint. https://doi.org/10.48550/arxiv.2309.01645

Carless, D., & Boud, D. (2018). The development of student feedback literacy: Enabling uptake of

feedback. Assessment & Evaluation in Higher Education, 43(8), 1315-1325.

https://doi.org/10.1080/02602938.2018.1463354

Carless, D., Salter, D., Yang, M., & Lam, J. (2011). Developing sustainable feedback practices. Studies in

Higher Education, 36(4), 395-407.

Cavalcanti, A. P., Barbosa, A., Carvalho, R., Freitas, F., Tsai, Y.-S., Gašević, D., & Mello, R. F. (2021).

Automatic feedback in online learning environments: A systematic literature review. Computers

and Education: Artificial Intelligence, 2, 100027.

https://doi.org/10.1016/j.caeai.2021.100027

Creswell, J. W., & Creswell, J. D. (2023). Research design: Qualitative, quantitative, and mixed methods

approaches (6th ed.). Sage Publications.

Crick, J. E., & Brennan, R. L. (1983). Manual for GENOVA: A generalized analysis of variance system

(Version 2.1). The American College Testing Program.

Cronbach, L. J., Goldine, C. Gleser, H. N., & Rajaratnam, N. (1972). The dependability of behavioral

measurements: Theory of generalizability for scores and profiles. John Wiley & Sons, Inc.

Dehghani, H., & Mashhadi, A. (2024). Exploring Iranian English as a foreign language teachers’

acceptance of ChatGPT in English language teaching: Extending the technology acceptance

model. Education and Information Technologies, 1-22. https://doi.org/10.1007/s10639-024-12660-9

de Jong, N. H. (2023). Assessing second language speaking proficiency. Annual Review of Linguistics, 9,

541-560. https://doi.org/10.1146/annurev-linguistics-030521-052114

Farazouli, A., Cerratto-Pargman, T., Bolander-Laksov, K., & McGrath, C. (2024). Hello GPT!

Goodbye home examination? An exploratory study of AI chatbots impact on university teachers’

assessment practices. Assessment & Evaluation in Higher Education, 1-13.

https://doi.org/10.1080/02602938.2023.2241676

Gibbs, G., & Simpson, C. (2004). Conditions under which assessment supports students’ learning.

Learning and Teaching in Higher Education, 1, 18-19.

Guo, K., & Wang, D. (2023). To resist it or to embrace it? Examining ChatGPT’s potential to support

teacher feedback in EFL writing. Education and Information Technologies.

https://doi.org/10.1007/s10639-023-12146-0

Guo, K., Wang, J., & Chu, S. K. W. (2022). Using chatbots to scaffold EFL students’ argumentative writing.

Assessing Writing, 54, 100666. https://doi.org/10.1016/j.asw.2022.100666

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81-112.

Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments?—A

generalizability theory approach. Assessing Writing, 13(3), 201-218.

https://doi.org/10.1016/j.asw.2008.10.002

Huang, J. (2012). Using generalizability theory to examine the accuracy and validity of large-scale ESL

writing assessment. Assessing Writing, 17(3), 123-139. https://doi.org/10.1016/j.asw.2011.12.003

Huang, B. H., Bailey, A. L., & Chang, Y. S. (2020). An investigation of the validity of a speaking assessment

for adolescent English language learners. Language Testing, 38(3).

https://doi.org/10.1177/0265532220925731

Huang, J., & Whipple, P. B. (2023). Rater variability and reliability of constructed response questions in

New York state high-stakes tests of English language arts and mathematics: Implications for

educational assessment policy. Humanities & Social Sciences Communications, 10(1), 1-10. https://doi.org/10.1057/s41599-023-02385-4

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023).

ChatGPT for good? On opportunities and challenges of large language models for education.

Learning and Individual Differences, 103, 102274.

Lee, M., & Evans, M. (2019). Investigating the operating mechanisms of the sources of L2 writing self-

efficacy at the stages of giving and receiving peer feedback. The Modern Language Journal, 103, 831-847. https://doi.org/10.1111/modl.1259

Li, J., Huang, J., & Cheng, S. (2022). The reliability, effectiveness, and benefits of peer assessment in

college EFL speaking classrooms: Student and teacher perspectives.

Studies in Educational Evaluation, 72, 101120.

Li, J., Huang, J., Wu, W., & Whipple, P. (2024). Evaluating the role of ChatGPT in enhancing EFL writing

assessments in classroom settings: A preliminary investigation. Humanities and Social Sciences

Communications, 11, 1268. https://doi.org/10.1057/s41599-024-03755-2

Link, S., Mehrzad, M., & Rahimi, M. (2022). Impact of automated writing evaluation on teacher

feedback, student revision, and writing improvement. Computer Assisted Language Learning,

35(4), 605-634. https://doi.org/10.1080/09588221.2020.1743323

Liu, G., & Ma, C. (2024). Measuring EFL learners’ use of ChatGPT in informal digital learning of English

based on the technology acceptance model. Innovation in Language Learning and Teaching,

18(2), 125-138. https://doi.org/10.1080/17501229.2023.2240316

Liu, Y., & Huang, J. (2020). The quality assurance of a national English writing assessment: Policy

implications for quality improvement. Studies in Educational Evaluation, 67, 100941.

https://doi.org/10.1016/j.stueduc.2020.100941

Lu, Q., Yao, Y., Xiao, L., Yuan, M., Wang, J., & Zhu, X. (2024). Can ChatGPT effectively complement

teacher assessment of undergraduate students’ academic writing? Assessment & Evaluation in

Higher Education, 1-18. https://doi.org/10.1080/02602938.2024.2301722

Pang, T., Kootsookos, A., & Cheng, C-T. (2024). Artificial intelligence use in feedback: A qualitative

analysis. Journal of University Teaching and Learning Practice, 21(6), 1-18.

Praphan, P. W., & Praphan, K. (2023). AI technologies in the ESL/EFL writing classroom: The villain or the

champion? Journal of Second Language Writing, 62, 101072.

https://doi.org/10.1016/j.jslw.2023.101072

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Sage Publications.

Song, C., & Song, Y. (2023). Enhancing academic writing skills and motivation: Assessing the efficacy of

ChatGPT in AI-assisted language learning for EFL students. Frontiers in Psychology, 14, 1260843. https://doi.org/10.3389/fpsyg.2023.1260843

Su, Y., Lin, Y., & Lai, C. (2023). Collaborating with ChatGPT in argumentative writing classrooms.

Assessing Writing, 57, 100752. https://doi.org/10.1016/j.asw.2023.100752

Sun, Q., Chen, F., & Yin, S. (2023). The role and features of peer assessment feedback in college English

writing. Frontiers in Psychology, 13, 1070618. https://doi.org/10.3389/fpsyg.2022.1070618

Wu, W., Huang, J., Han, C., & Zhang, J. (2022). Evaluating peer feedback as a reliable and valid

complementary aid to teacher feedback in EFL writing classrooms: A feedback giver perspective.

Studies in Educational Evaluation, 73, 101140. https://doi.org/10.1016/j.stueduc.2022.101140

Yan, D. (2023). Impact of ChatGPT on learners in a L2 writing practicum: An exploratory investigation.

Education and Information Technologies, 28(11), 13943–13967. https://doi.org/10.1007/s10639-023-11742-4

Zhang, R., Zou, D., & Cheng, G. (2023a). Chatbot-based learning of logical fallacies in EFL writing:

Perceived effectiveness in improving target knowledge and learner motivation. Interactive

Learning Environments, 1-18. https://doi.org/10.1080/10494820.2023.2220374

Zhang, R., Zou, D., & Cheng, G. (2023b). Chatbot-based training on logical fallacy in EFL argumentative

writing. Innovation in Language Learning and Teaching, 17(5), 932-945.

https://doi.org/10.1080/17501229.2023.2197417

Zhao, Q. (2010). The scorer reliability of the writing section of the HSK: A case study of experienced and

inexperienced scorers. China Examinations, 10, 13-19.

Zhao, C., & Huang, J. (2020). The impact of the scoring system of a large-scale standardized EFL writing

assessment on its score variability and reliability: Implications for assessment policy makers.

Studies in Educational Evaluation, 67, 100911. https://doi.org/10.1016/j.stueduc.2020.100911

Zou, M., & Huang, L. (2023a). The impact of ChatGPT on L2 writing and expected responses: Voice from

doctoral students. Education and Information Technologies.

https://doi.org/10.1007/s10639-023-12397-x

Zou, M., & Huang, L. (2023b). To use or not to use? Understanding doctoral students’ acceptance of

ChatGPT in writing through technology acceptance model. Frontiers in Psychology, 14, 1259531. https://doi.org/10.3389/fpsyg.2023.1259531

Zhu, Y., Feng, S. L., & Xin, T. (2013). Improving dependability of New HSK writing test score: A

generalizability theory-based approach. Journal of Psychological Science, 36, 479-488.

Evaluating ChatGPT4o as a teacher assessment aid: Reliability and feedback in EFL speaking classrooms

Abstract

Downloads

Issue

Section

References

Most read articles by the same author(s)

Similar Articles

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

Issue

Section

References

Most read articles by the same author(s)

Similar Articles