Evaluating the reliability and validity of ChatGPT5 in assessing CFL and NC students’ Chinese writing: A preliminary study

Main Article Content

Rong Wang
Jinyan Huang
Ziying Qin
Yuehan Wang

Abstract

This study examined the reliability and validity of OpenAI’s GPT-5 for evaluating Chinese students’ writing in CFL (Chinese-as-a-Foreign-Language) and NC (Native Chinese) contexts using generalizability (G-) theory. The writing samples consisted of 36 HSK-6 writing samples, assessed by ChatGPT5. G-study results showed that persons accounted for 93.54% of variance in CFL scores and 80.63% in NC scores, indicating ChatGPT5 effectively distinguished between writing proficiency levels. D-study analysis revealed that ChatGPT5 achieved high reliability coefficients with just a single rater (G-coefficient = .95 for CFL; G-coefficient = .81 for NC). The results indicated high reliability for both student groups, with G-coefficients being greater than .80. In addition, ChatGPT5 generated substantial qualitative feedback and feedback relevant to the language domain, which is particularly useful for CFL learners. Even though there were no significant differences indicated for feedback in the content or organization aspect. In summary, ChatGPT5 demonstrates reliability and validity as an additional assessment tool that can provide relevant feedback based on consistent scoring that can be used to support student development in writing. Overall, the results indicate that ChatGPT5 can be a valid tool for classroom-based assessment of CFL writing.

Downloads

Download data is not yet available.

Article Details

Section

Assessment and Evaluation

References

American Educational Research Association. (2011). Code of ethics. Educational Researcher, 40(3), 145-156. https://doi.org/10.3102/0013189X11410403

Ansari, A. N., Ahmad, S., & Bhutta, S. M. (2023). Mapping the global evidence around the use of ChatGPT in higher education: A systematic scoping review. Education and Information Technologies. https://doi.org/10.1007/s10639-023-12223-4

Brennan, R. L. (2001). Statistics for social science and public policy: Generalizability theory.

Bucol, J. L., & Sangkawong, N. (2024). Exploring ChatGPT as a writing assessment tool. Innovations in Education and Teaching International, 1-16

Creswell, J. W., & Creswell, J. D. (2023). Research design: Qualitative, quantitative, and mixed methods approaches (6th Ed.). Sage Publications.

Crick, J. E., & Brennan, R. L. (1983). Manual for GENOVA: A generalized analysis of variance system (Version 2.1). The American College Testing Program.

Cronbach, L. J., Goldine, C. Gleser, H. N., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: John Wiley & Sons, Inc.

Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86(1), 67-96.

Farazouli, A., Cerratto-Pargman, T., Bolander-Laksov, K., & McGrath, C. (2023). Hello GPT! Goodbye home examination? An exploratory study of AI chatbots impact on university teachers’ assessment practices. Assessment & Evaluation in Higher Education, 1-13.

Gao, X., & Brennan, R. L. (2001). Variability of estimated variance components and related statistics in a performance assessment. Applied Measurement in Education, 14(2), 191-203.

Gong, Y., Lai, C., & Gao, X. (2020). The teaching and learning of Chinese as a second or foreign language: The current situation and future directions. Frontiers of Education in China, 15(1), 1-13. https://doi.org/10.1007/s11516-020-0001-0

Guo, K., & Wang, D. (2023). To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Education and Information Technologies. https://doi.org/10.1007/s10639-023-12146-0

Guo, K., & Wang, D. (2024). To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Education and Information Technologies, 29(7), 8435-8463.

Guo, K., Wang, J., & Chu, S. K. W. (2022). Using chatbots to scaffold EFL students’ argumentative writing. Assessing Writing, 54, 100666.

Hamp-Lyons, L., & Mathias, S. P. (1994). Examining expert judgments of task difficulty on essay tests. Journal of Second Language Writing, 3(1), 49–68.

Hanban, T. O. O. C. C. I. (2010). The New HSK test syllabus. The Commercial Press.

Hebebci, M. T., Bertiz, Y., & Alan, S. (2020). Investigation of views of students and teachers on distance education practices during the Coronavirus (COVID-19) Pandemic. International Journal of Technology in Education and Science, 4(4), 267-282. https://doi.org/10.46328/ijtes.v4i4.113

Hu, C., & Zhang, Y. (2014). A study of college English writing feedback system based on M-learning. Modern Educational Technology, 24(7), 71-78.

Huang, J. (2012). Using generalizability theory to examine the accuracy and validity of large-scale ESL writing. Assessing Writing, 17(3), 123-139.

Huang, J., Foote, C. (2010). Grading between the lines: What really impacts professors’ holistic evaluation of ESL graduate student writing. Language Assessment Quarterly, 7(3), 219-233.

Huang, J., Zhu, D., Xie, D., & Shu, T. (2023). Examining the reliability of an international Chinese proficiency standardized writing assessment: Implications for assessment policy makers. Assessing Writing, 55, 100693. DOI: 10.1016/j.asw.2023.100693

Janopoulos, M. (1992). University faculty tolerance of NS and NNS writing errors: A comparison. Journal of Second Language Writing, 1(2), 109–121.

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103: 102274

Kastrati, Z., Imran, A. S., & Kurti, A. (2020). Weakly supervised framework for aspect-based sentiment analysis on students’ reviews of MOOCs. IEEE Access, 8, 106799-106810. https://doi.org/10.1109/ACCESS.2020.3000739

Khodi, A. (2021). The affectability of writing assessment scores: a G-theory analysis of rater, task, and scoring method contribution. Language Testing in Asia, 11(1), 30.

Lee, Y. W., & Kantor, R. (2007). Evaluating prototype tasks and alternative rating schemes for a new ESL writing test through G-theory. International Journal of Testing, 7(4), 353-385.

Lee, Y., Kantor, R., & Mollaun, P. (2002). Score dependability of the writing and speaking sections of new TOEFL. Paper Presented at the Annual Meeting of National Council on Measurement in Education.

Li, H., & He, L. (2015). A comparison of EFL raters’ essay-rating processes across two types of rating scales. Language Assessment Quarterly, 12(2), 178–212.

Li, J., Huang, J., Wu, W., & Whipple, P. (2024). Evaluating the role of ChatGPT in enhancing EFL writing assessments in classroom settings: A preliminary investigation. Humanities and Social Sciences Communications, 11:1268. https://doi.org/10.1057/s41599-024-03755-2

Li, J., & Huang, J. (2022). The impact of essay organization and overall quality on the holistic scoring of EFL writing: Perspectives from classroom English teachers and national writing raters. Assessing Writing, 51, 100604

Link, S., Mehrzad, M., & Rahimi, M. (2022). Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement. Computer Assisted Language Learning, 35(4), 605–634. https://doi.org/10.1080/09588221.2020.1743323

Liu, Y., & Huang, J. (2020). The quality assurance of a national English writing assessment: Policy implications for quality improvement. Studies in Educational Evaluation, 67, 100941

Lu, Q., Yao, Y., Xiao, L., Yuan, M., Wang, J., & Zhu, X. (2024). Can ChatGPT effectively complement teacher assessment of undergraduate students’ academic writing?. Assessment & Evaluation in Higher Education, 49(5), 616-633

Ma, J., & Zhao, K. (2018). International student education in China: Characteristics, challenges, and future trends. Higher Education, 76(4), 735-751. https://doi.org/10.1007/s10734-018-0235-4

Praphan, P. W., & Praphan, K. (2023). AI technologies in the ESL/EFL writing classroom: The villain or the champion? Journal of Second Language Writing, 62, 101072. https://doi.org/10.1016/j.jslw.2023.101072

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer.

Shi, H., Chai, C. S., Zhou, S., & Aubrey, S. (2025). Comparing the effects of ChatGPT and automated writing evaluation on students’ writing and ideal L2 writing self. Computer Assisted Language Learning, 1-28.

Song, C., & Song, Y. (2023). Enhancing academic writing skills and motivation: Assessing the efficacy of ChatGPT in AI-assisted language learning for EFL students. Frontiers in Psychology, 14, 1260843.

Sperber, A., Devellis, R., & Boehlecke, B. (1994). Cross-cultural translation: Methodology and validation. Journal of Cross-Cultural Psychology, 25, 50-524. https://doi.org/10.1177/0022022194254006

Su, Y., Lin, Y., & Lai, C. (2023). Collaborating with ChatGPT in argumentative writing classrooms. Assessing Writing, 57, 100752

Sweedler-Brown, C. O. (1993). ESL essay evaluation: The influence of sentence-level and rhetorical features. Journal of Second Language Writing, 2, 3–17.

Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11, 197–223.

Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15, 263–287.

Weigle, S. C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6, 145–178.

Weigle, S. C., Boldt, H., & Valsecchi, M. I. (2003). Effects of task and rater background on the evaluation of ESL writing: A pilot study. TESOL Quarterly, 37, 345–354.

Xu, C., & Tao, J. (2019). The effects of task types and raters on Chinese writing scoring. Collection of Chinese Language and Literature, 115, 239-250.

Yamanaka, H. (2005). Using generalizability theory in the evaluation of L2 writing. JALT Journal, 27, 169-185.

Yan, D. (2023). Impact of ChatGPT on learners in a L2 writing practicum: An exploratory investigation. Education and Information Technologies, 28(11), 13943–13967.

Zhang, R., Zou, D., & Cheng, G. (2023a). Chatbot-based learning of logical fallacies in EFL writing: Perceived effectiveness in improving target knowledge and learner motivation. Interactive Learning Environments, 1-18. https://doi.org/10.1080/10494820.2023.2220374

Zhang, R., Zou, D., & Cheng, G. (2023b). Chatbot-based training on logical fallacy in EFL argumentative writing. Innovation in Language Learning and Teaching, 17(5), 932-945. https://doi.org/10.1080/17501229.2023.2197417

Zhao, Q. (2010). The scorer reliability of the writing section of the HSK: A case study of experienced and inexperienced scorers. China Examinations, 10, 13-19.

Zhu, Y., Feng, S. L., & Xin, T. (2013). Improving dependability of New HSK writing test scores: A generalizability theory-based approach. Journal of Psychological Science, 36, 479-488.

Zou, M., & Huang, L. (2023a). The impact of ChatGPT on L2 writing and expected responses: Voice from doctoral students. Education and Information Technologies. https://doi.org/10.1007/s10639-023-12397-x

Zou, M., & Huang, L. (2023b). To use or not to use? Understanding doctoral students’ acceptance of ChatGPT in writing through technology acceptance model. Frontiers in Psychology, 14, 1259531. https://doi.org/10.3389/fpsyg.2023.1259531

Most read articles by the same author(s)

Similar Articles

You may also start an advanced similarity search for this article.