Does AI peer assessment work in L2 writing? Investigating evidence from CFL classrooms
Main Article Content
Abstract
This study evaluated the role of ChatGPT and DeepSeek in enhancing peer assessment in Chinese-as-a-foreign-language (CFL) writing classrooms, focusing on the scoring reliability and actionability of feedback provided by ChatGPT-4o and DeepSeek-R1 compared to human peer assessors. A sample of 32 CFL learners from six universities in China participated, tasked with condensing a 1000-word narrative into a 400-word essay, mirroring HSK-6 writing assessment conditions. Ten peer assessors, ChatGPT-4o, and DeepSeek-R1 evaluated the writing samples. Using a univariate G-theory framework, this study analyzed both score variability and reliability, revealing that ChatGPT-4o and DeepSeek-R1 provided highly reliable and consistent scores, outperforming human peer assessors. Qualitative analysis of feedback showed that ChatGPT-4o and DeepSeek-R1 offered more detailed and actionable feedback than human peers. These findings suggest that AI tools like ChatGPT-4o and DeepSeek-R1 can significantly enhance the assessment process in CFL writing classrooms by providing reliable scores and actionable feedback, thus benefiting both CFL students and instructors. The study also discusses the implications for CFL learners and their Chinese instructors.
Downloads
Article Details
Section

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
American Educational Research Association. (2011). Code of ethics. Educational Researcher, 40(3), 145-156. https://doi.org/10.3102/0013189X11410403
Ansari, A. N., Ahmad, S., & Bhutta, S. M. (2023). Mapping the global evidence around the use of ChatGPT in higher education: A systematic scoping review. Education and Information Technologies, https://doi.org/10.1007/s10639-023-12223-4
Barrot, J. S. (2023). Using ChatGPT for second language writing: Pitfalls and potentials. Assessing Writing, 57, 100745. https://doi.org/10.1016/j.asw.2023.100745
Boud, D., & Molloy, E. (2013). Feedback in higher and professional education: Understanding it and doing it well. London: Routledge.
Brennan, R. L. (2001). Generalizability theory. New York: Springer-Verlag. https://doi.org/10.1007/978-1-4757-3456-0
Carless, D., & Boud, D. (2018). The development of student feedback literacy: Enabling uptake of feedback. Assessment & Evaluation in Higher Education, 43(8), 1315-1325. https://doi.org/10.1080/02602938.2018.1463354
Cavalcanti, A. P., Barbosa, A., Carvalho, R., Freitas, F., Tsai, Y.-S., Gašević, D., & Mello, R. F. (2021). Automatic feedback in online learning environments: A systematic literature review. Computers and Education: Artificial Intelligence, 2, 100027. https://doi.org/10.1016/j.caeai.2021.100027
Conroy, G., & Mallapaty, S. (2025). How China created AI model DeepSeek and shocked the world. Nature, 638, 300-301.
Creswell, J. W., & Creswell, J. D. (2023). Research design: Qualitative, quantitative, and mixed methods approaches (6th ed.). Thousand Oaks, CA: Sage Publications.
Crick, J. E., & Brennan, R. L. (1983). Manual for GENOVA: A generalized analysis of variance system (Version 2.1). Iowa City, IA: The American College Testing Program.
Cronbach, L. J., Gleser, G. C., N. H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: John Wiley & Sons, Inc.
Ewe, L. C., & Min, F. (2021). Teaching Chinese language outside of China: The case of Chinese teachers in Thailand. Journal of Chinese Language Teaching, 21(4).
Farazouli, A., Cerratto-Pargman, T., Bolander-Laksov, K., & McGrath, C. (2023). Hello GPT! Goodbye home examination? An exploratory study of AI chatbots’ impact on university teachers’ assessment practices. Assessment & Evaluation in Higher Education, 1-13. https://doi.org/10.1080/02602938.2023.2241676
Fathi, J., Ahmadnejad, M., & Yousofi, N. (2019). Effects of blog-mediated writing instruction on L2 writing motivation, self-efficacy, and self-regulation: A mixed methods study. Journal of Research in Applied Linguistics, 10, 159-181. https://doi.org/10.22055/rals.2019.14722
Gao, Y., Wang, D., & Li, M. (2025). Localized AI models for Chinese academic writing: Case studies of DeepSeek and Doubao. Journal of Educational Technology & Society, 28(2), 1-15. https://doi.org/10.1007/s10639-024-13012-3
Gibbs, G., & Simpson, C. (2004). Conditions under which assessment supports students’ learning. Learning and Teaching in Higher Education, 1, 18-19.
Gong, Y., Lai, C., & Gao, X. (2020). The teaching and learning of Chinese as a second or foreign language: The current situation and future directions. Frontiers of Education in China, 15(1), 1-13. https://doi.org/10.1007/s11516-020-0001-0
Guo, K., & Wang, D. (2023). To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Education and Information Technologies, https://doi.org/10.1007/s10639-023-12146-0
Guo, K., Wang, J., & Chu, S. K. W. (2022). Using chatbots to scaffold EFL students’ argumentative writing. Assessing Writing, 54, 100666. https://doi.org/10.1016/j.asw.2022.100666
Gurbuz, S., Bahar, H., Yavuz, U., Keskin, A., Karslioglu, B., & Yener Solak. (2025). Comparative efficacy of ChatGPT and DeepSeek in addressing patient queries on gonarthrosis and total knee arthroplasty. Arthroplasty Today, 33, Article 101730. https://doi.org/10.1016/j.artd.2025.101730
Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81-112.
Hebebci, M. T., Bertiz, Y., & Alan, S. (2020). Investigation of views of students and teachers on distance education practices during the Coronavirus (COVID-19) Pandemic. International Journal of Technology in Education and Science, 4(4), 267-282. https://doi.org/10.46328/ijtes.v4i4.113
Hu, C., & Zhang, Y. (2014). A study of college English writing feedback system based on M-learning. Modern Educational Technology, 7, 71-78. https://doi.org/10.3969/j.issn.1009-8097.2014.07.010
Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments? A generalizability theory approach. Assessing Writing, 13(3), 201-218. https://doi.org/10.1016/j.asw.2008.04.001
Huang, J. (2012). Using generalizability theory to examine the accuracy and validity of large-scale ESL writing assessment. Assessing Writing, 17(3), 123-139. https://doi.org/10.1016/j.asw.2012.05.002
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274.
Kastrati, Z., Imran, A. S., & Kurti, A. (2020). Weakly supervised framework for aspect-based sentiment analysis on students’ reviews of MOOCs. IEEE Access, 8, 106799-106810. https://doi.org/10.1109/ACCESS.2020.3000739
Lee, M., & Evans, M. (2019). Investigating the operating mechanisms of the sources of L2 writing self-efficacy at the stages of giving and receiving peer feedback. Modern Language Journal, 103, 831-847. https://doi.org/10.1111/modl.1259
Li, J., Huang, J., Wu, W., & Whipple, P. (2024). Evaluating the role of ChatGPT in enhancing EFL writing assessments in classroom settings: A preliminary investigation. Humanities and Social Sciences Communications, 11, 1268. https://doi.org/10.1057/s41599-024-03755-2
Li, J., Huang, J., & Sheeran, T. (2025). ChatGPT4o as an AI peer assessor in EFL speaking classrooms: Examining scoring reliability and feedback effectiveness. SAGE Open, 15(2), 21582440251369938. https://doi.org/10.1177/21582440251369938
Li, H., Xiong, Y., Hunter, C. V., Guo, X., & Tywoniw, R. (2020). Does peer assessment promote student learning? A meta-analysis. Assessment & Evaluation in Higher Education, 45(2), 193-211. https://doi.org/10.1080/02602938.2019.1620679
Link, S., Mehrzad, M., & Rahimi, M. (2022). Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement. Computer Assisted Language Learning, 35(4), 605–634. https://doi.org/10.1080/09588221.2020.1743323
Liu, Y., & Huang, J. (2020). The quality assurance of a national English writing assessment: Policy implications for quality improvement. Studies in Educational Evaluation, 67, 100941
Liu, X., Huang, J., Deng, Y., & Spiridakis, J. (2025). AI versus human assessment in EFL speaking classrooms: A comparative study in China. Computer Assisted Language Learning, 38(2), 456-481. https://doi.org/10.1080/09588221.2025.2530555
Liu, N. F., & Carless, D. (2006). Peer feedback: The learning element of peer assessment. Teaching in Higher Education, 11(3), 279-290. https://doi.org/10.1080/13562510600680582
Lu, Q., Yao, Y., Xiao, L., Yuan, M., Wang, J., & Zhu, X. (2024). Can ChatGPT effectively complement teacher assessment of undergraduate students’ academic writing? Assessment & Evaluation in Higher Education, 1-18. https://doi.org/10.1080/02602938.2024.2301722
Ma, J., & Zhao, K. (2018). International student education in China: Characteristics, challenges, and future trends. Higher Education, 76(4), 735-751. https://doi.org/10.1007/s10734-018-0235-4
Popham, W. J. (2011). Classroom assessment: What teachers need to know? MA: Pearson Education, Inc.
Praphan, P. W., & Praphan, K. (2023). AI technologies in the ESL/EFL writing classroom: The villain or the champion? Journal of Second Language Writing, 62, 101072. https://doi.org/10.1016/j.jslw.2023.101072
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Measurement Methods for the Social Sciences Series 1. Newbury Park, CA: Sage Publications.
Shen, B., Bai, B., & Xue, W. (2020). The effects of peer assessment on learner autonomy: An empirical study in a Chinese college English writing class. Studies in Educational Evaluation, 64, 100821. https://doi.org/10.1016/j.stueduc.2019.100821
Shermis, M. D., & Hamner, B. (2013). Contrasting state-of-the-art automated scoring of essays: Analysis. Assessment in Education: Principles, Policy & Practice, 20(1), 131-148.
Song, C., & Song, Y. (2023). Enhancing academic writing skills and motivation: Assessing the efficacy of ChatGPT in AI-assisted language learning for EFL students. Frontiers in Psychology, 14, 1260843. https://doi.org/10.3389/fpsyg.2023.1260843
Sperber, A., Devellis, R., & Boehlecke, B. (1994). Cross-cultural translation: Methodology and validation. Journal of Cross-Cultural Psychology, 25, 50-524. https://doi.org/10.1177/0022022194254006
Su, Y., Lin, Y., & Lai, C. (2023). Collaborating with ChatGPT in argumentative writing classrooms. Assessing Writing, 57, 100752. https://doi.org/10.1016/j.asw.2023.100752
Sun, Q., Chen, F., & Yin, S. (2023). The role and features of peer assessment feedback in college English writing. Frontiers in Psychology, 13, 1070618. https://doi.org/10.3389/fpsyg.2022.1070618
Tian, J. (2011). The effects of peer editing versus co-writing on writing in Chinese-as-a-foreign language. Unpublished doctoral dissertation.
Uldin, H., Saran, S., Gandikota, G., Iyengar, K. P., Vaishya, R., Parmar, Y., Rasul, F., & Botchu, R. (2025). A comparison of performance of DeepSeek-R1 model-generated responses to musculoskeletal radiology queries against ChatGPT-4 and ChatGPT-4o – A feasibility study. Clinical Imaging, 123, Article 110506. https://doi.org/10.1016/j.clinimag.2025.110506
Vygotsky, L. (1978). Mind in society: The development of higher psychological processes. Harvard University Press.
Wiliam, D. (2011). What is assessment for learning? Studies in Educational Evaluation, 37(1), 3-14.
Xu, C., & Tao, J. (2019). The effects of task types and raters on Chinese writing scoring. Collection of Chinese Language and Literature, 115, 239-250.
Yan, D. (2023). Impact of ChatGPT on learners in an L2 writing practicum: An exploratory investigation. Education and Information Technologies, 28(11), 13943-13967. https://doi.org/10.1007/s10639-023-11742-4
Yousefifard, S., & Fathi, J. (2021). Exploring the impact of blogging in English classrooms: Focus on the ideal writing self of EFL learners. International Journal of Instruction, 14, 913-932. https://doi.org/10.29333/iji.2021.14452a
Zhang, Q., Huang, J., Liu, Y., & Whipple, P. B. (2025). ChatGPT4o mini as an AI teacher assessor in Chinese EFL writing classrooms. Asia Pacific Journal of Education, 45(3), 345–362. https://doi.org/10.1080/02188791.2025.2539267
Zhang, F., & Hyland, K. (2018). Student engagement with teacher and automated feedback on L2 writing. Assessing Writing, 36, 90-102. https://doi.org/10.1016/j.asw.2018.04.002
Zhang, R., Zou, D., & Cheng, G. (2023a). Chatbot-based learning of logical fallacies in EFL writing: Perceived effectiveness in improving target knowledge and learner motivation. Interactive Learning Environments, 1-18. https://doi.org/10.1080/10494820.2023.2220374
Zhang, R., Zou, D., & Cheng, G. (2023b). Chatbot-based training on logical fallacy in EFL argumentative writing. Innovation in Language Learning and Teaching, 17(5), 932-945. https://doi.org/10.1080/17501229.2023.2197417
Zhao, Q. (2010). The scorer reliability of the writing section of the HSK: A case study of experienced and inexperienced scorers. China Examinations, 10, 13-19.
Zhao, C., & Huang*, J. (2020). The impact of the scoring system of a large-scale standardized EFL writing assessment on its score variability and reliability: Implications for assessment policy makers. Studies in Educational Evaluation, 67, 100911
Zou, M., & Huang, L. (2023a). The impact of ChatGPT on L2 writing and expected responses: Voice from doctoral students. Education and Information Technologies, https://doi.org/10.1007/s10639-023-12397x
Zou, M., & Huang, L. (2023b). To use or not to use? Understanding doctoral students’ acceptance of ChatGPT in writing through technology acceptance model. Frontiers in Psychology, 14, 1259531. https://doi.org/10.3389/fpsyg.2023.1259531