The Effectiveness of Mantel Haenszel Log Odds Ratio Method in Detecting Differential Item Functioning Across Different Sample Sizes and Test Lengths Using Real Data Analysis

Reem Mohammad  Elyan; Majed Mahmoud  Al jodeh

doi:10.35516/edu.v51i3.6755

Authors

Reem Mohammad Elyan Deanship of Scientific Research and Postgraduate Studies, Yarmouk University, Irbid, Jordan https://orcid.org/0000-0002-0437-3918
Majed Mahmoud Al jodeh Department of Education and Psychology, College of Education and Arts, University of Tabuk, Tabuk, Saudi Arabia https://orcid.org/0009-0003-1530-930X

DOI:

https://doi.org/10.35516/edu.v51i3.6755

Keywords:

Mantel Haenszel, Log Odds Ratio, DIF, Real Data, PISA test, Tenth-grade

Abstract

Objectives: This study aims to determine the effectiveness of the Mantel Haenszel Log Odds Ratio method in detecting Differential Item Functioning (DIF) across gender, while considering variations in sample size and test length. Utilizing real data, the study draws from a sample of tenth-grade students in Jordan who participated in the 2018 PISA International Mathematics Test.

Methods: The study employs the experimental methodology, utilizing three levels of sample size and test length: (342, 200, and 100) and (30, 20, and 10), respectively. Nine iterations of the DDFS program were conducted to collect the results, representing nine scenarios resulting from the intersection of sample size and test length levels.

Results: The study indicates that variations in sample size and test length significantly affect the Mantel-Hanzel (MH) method. Specifically, it observes an improvement in the MH method’s ability to detect DIF items with larger sample sizes, while maintaining a consistent test length. Conversely, the method’s efficacy declines with longer test lengths, despite maintaining a fixed sample size at a specific level.

Conclusion: The study recommends using a large sample size and a short test length for effective detection of DIF items using the MH method.

Downloads

Download data is not yet available.

References

Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29(1), 67-91. https://doi.org/10.1111/j.1745-3984.1992.tb00368.x

Alomari, H., Akour, M. M., & Al Ajlouni, J. (2023). The effect of Sample Size on Differential Item Functioning and Differential Distractor Functioning in multiple-choice items. Psychology Hub, 40(2), 17–24. https://doi.org/10.13133/2724-2943/17992

Arıkan, Ç., Uğurlu, S., & Atar, B. (2016). A DIF and bias study by using MIMIC, SIBTEST, Logistic Regression, and Mantel-Haenszel methods. Hacettepe University Journal of Education, 31(1), 34-52. DOI:10.16986/HUJE.2015014226

Camilli, G., Shepard, L. A., & Shepard, L. (1994). Methods for identifying biased test items (Vol. 4). SAGE: university of Michigan.

Dorans, N. J., & Holland, P. W. (1992). DIF detection and description: Mantel‐Haenszel and standardization 1, 2. ETS Research Report Series, 1992(1), i-40. https://doi.org/10.1002/j.2333-8504.1992.tb01440.x

Eom, M. (2008). Underlying factors of MELAB listening construct. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 6, 77–94.

Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied psychological measurement, 29(4), 278-295. DOI:10.1177/0146621605275728

GAO, X. (2019). A comparison of six DIF detection methods. Unpublished Master Theses, University of Connecticut Graduate School, https://digitalcommons.lib.uconn.edu/gs_theses/1411

GU, K. (2023). Washback Effects of IELTS Test on Teachers' Adoption of Teaching Materials in the Classroom in China. International Journal on Social & Education Sciences (IJonSES), 5(2). DOI: https://doi.org/10.46328/ijonses.513

Holland, P. W., & Thayer, D. T. (1986). Differential item functioning and the Mantel‐Haenszel procedure. ETS Research Report Series, 1986(2), i-24. DOI: https://doi.org/10.1002/j.2330-8516.1986.tb00186.x

Ihlenfeldt, S. D., & Rios, J. A. (2023). A meta-analysis on the predictive validity of English language proficiency assessments for college admissions. Language Testing, 40(2), 276-299. DOI:10.1177/02655322221112364

Kabasakala, K., Arsan, N., Gok, B., & Kelecooglu, H. (2014). Comparing Performances (Type I error and Power) of IRT Likelihood Ratio SIBTEST and Mantel-Haenszel Methods in the Determination of Differential Item Functioning. Educational Sciences: Theory & Practice, 14(6), 2186-2193. DOI: 10.12738/estp.2014.6.2165

Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4), 719-748. https://doi.org/10.1093/jnci/22.4.719

Marôco, J. (2021). Portugal: The PISA Effects on Education. In: Crato, N. (eds) Improving a Country’s Education. Springer, Cham. https://doi.org/10.1007/978-3-030-59031-4_8

Mellenbergh, G. J. (1989). Item bias and item response theory. International journal of educational research, 13(2), 127-143.

Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied psychological measurement, 17(4), 297-334. https://doi.org/10.1177/014662169301700401

Münch, R., & Wieczorek, O. (2023). Improving schooling through effective governance? The United States, Canada, South Korea, and Singapore are in the struggle for PISA scores. Comparative Education, 59(1), 59-76. DOI:10.1080/03050068.2022.2138176

Narayanon, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied psychological measurement, 20(3), 257-274. https://doi.org/10.1177/014662169602000306

Park, G. (2008). Differential Item Functioning on an English Listening Test across Gender. TESOL Quarterly, 42(1), 115-123.

Penfield, R. D. (2010). DDFS: Differential distractor functioning software. Applied psychological measurement, 34(8), 646-647. https://doi.org/10.1177/0146621610375690

Penfield, R. D., & Camilli, G. (2006). Five Differential Item Functioning and Item Bias. Handbook of statistics, 26, 125-167. DOI:10.1016/S0169-7161(06)26005-X

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of educational measurement, 27(4), 361-370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x

Taylor, C. S., & Lee, Y. (2012). Gender DIF in reading and mathematics tests with mixed item formats. Applied Measurement in Education, 25(3), 246-280. https://doi.org/10.1080/08957347.2012.687650.

the MELAB Listening Test. Language Assessment Quarterly, 8, 361–385. DOI:10.1080/15434303.2011.628632

Vahid A., Christine C. & Lee O. (2011). An Investigation of Differential Item Functioning in

Wagner, A. (2004). A construct validation study of the extended listening sections of the ECRE and MELAB. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 2, 1–23.

Wall, D., & Horák, T. (2006). The impact of changes in the TOEFL examination on teaching and learning in Central and Eastern Europe: Phase 1, the baseline study. ETS Research Report Series, 2006(1), i-199. https://doi.org/10.1002/j.2333-8504.2006.tb02024.x

Wall, D., & Horák, T. (2008). The impact of changes in the TOEFL examination on teaching and learning in Central and Eastern Europe: Phase 2, coping with change. ETS Research Report Series, 2008(2), i-105. https://doi.org/10.1002/j.2333-8504.2008.tb02123.x

Williams, S. (1997). The unbiased anchor bridging the gap between DIF and item bias. Applied Measurement and Education, 10(3), 253-267. https://doi.org/10.1207/s15324818ame1003_4