11 References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Bradshaw, L. (2016). Diagnostic classification models. In A. A. Rupp & J. Leighton (Eds.), The handbook of cognition and assessment: Frameworks, methodologies, and applications (1st ed., pp. 297–327). John Wiley & Sons. https://doi.org/10.1002/9781118956588.ch13

Camilli, G., & Shepard, L. A. (1994). Method for identifying biased test items (4th ed.). SAGE Publications, Inc.

Chapelle, C. A. (2021). Argument-Based Validation in Testing and Assessment. SAGE Publications, Inc.

Chen, J., Torre, J. de la, & Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50(2), 123–140. https://doi.org/10.1111/j.1745-3984.2012.00185.x

Cicchetti, D. V., & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551–558. https://doi.org/10.1016/0895-4356(90)90159-M

Clark, A. K., Karvonen, M., Swinburne Romine, R., & Kingston, N. (2018). Teacher use of score reports for instructional decision-making: Preliminary findings. National Council on Measurement in Education Annual Meeting, New York, NY. https://dynamiclearningmaps.org/sites/default/files/documents/presentations/NCME_2018_Score_Report_Use_Findings.pdf

Clark, A. K., Kobrin, J., & Hirt, A. (2022). Educator perspectives on instructionally embedded assessment (Research Synopsis No. 22-01). University of Kansas, Accessible Teaching, Learning, and Assessment Systems. https://dynamiclearningmaps.org/sites/default/files/documents/publication/IE_Focus_Groups_project_brief.pdf

Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159. https://doi.org/10.1037//0033-2909.112.1.155

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. https://doi.org/10.1007/BF02310555

Dynamic Learning Maps Consortium. (2022). 2021–2022 Technical Manual—Year-End Model. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2023a). 2022–2023 Technical Manual Update—Science. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2023b). Accessibility Manual 2022–2023. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2023c). Educator Portal User Guide. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2023d). Test Administration Manual 2022–2023. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543–549. https://doi.org/10.1016/0895-4356(90)90158-L

Henson, R., & Douglas, J. (2005). Test construction for cognitive diagnosis. Applied Psychological Measurement, 29(4), 262–277. https://doi.org/10.1177/0146621604272623

Henson, R., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74(2), 191–210. https://doi.org/10.1007/s11336-008-9089-5

Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power raters using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329–349. https://doi.org/10.1207/S15324818AME1404_2

Johnson, M. S., & Sinharay, S. (2018). Measures of agreement to assess attribute-level classification accuracy and consistency for cognitive diagnostic assessments. Journal of Educational Measurement, 55(4), 635–664. https://doi.org/10.1111/jedm.12196

Kane, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000

Karvonen, M., Wakeman, S. Y., Browder, D. M., Rogers, M. A. S., & Flowers, C. (2011). Academic curriculum for students with significant cognitive disabilities: Special education teacher perspectives a decade after IDEA 1997 [Research Report]. National Alternate Assessment Center. https://files.eric.ed.gov/fulltext/ED521407.pdf

Kobrin, J., Clark, A. K., & Kavitsky, E. (2022). Exploring educator perspectives on potential accessibility gaps in the Dynamic Learning Maps alternate assessment (Research Synopsis No. 22-02). University of Kansas, Accessible Teaching, Learning, and Assessment Systems. https://dynamiclearningmaps.org/sites/default/files/documents/publication/Accessibility_Focus_Groups_project_brief.pdf

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310

O’Leary, S., Lund, M., Ytre-Hauge, T. J., Holm, S. R., Naess, K., Dalland, L. N., & McPhail, S. M. (2014). Pitfalls in the use of kappa when interpreting agreement between multiple raters in reliability studies. Physiotherapy, 100, 27–35. https://doi.org/10.1016/j.physio.2013.08.002

Pontius, R. G., Jr., & Millones, M. (2011). Death to kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment. International Journal of Remote Sensing, 32(15), 4407–4429. https://doi.org/10.1080/01431161.2011.552923

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x

Templin, J., & Bradshaw, L. (2013). Measuring the reliability of diagnostic classification model examinee estimates. Journal of Classification, 30(2), 251–275. https://doi.org/10.1007/s00357-013-9129-4

Thompson, W. J., Clark, A. K., & Nash, B. (2019). Measuring the reliability of diagnostic mastery classifications at multiple levels of reporting. Applied Measurement in Education, 32(4), 298–309. https://doi.org/10.1080/08957347.2019.1660345

Thompson, W. J., Nash, B., Clark, A. K., & Hoover, J. C. (2023). Using simulated retests to estimate the reliability of diagnostic assessment systems. Journal of Educational Measurement, 60(3), 455–475. https://doi.org/10.1111/jedm.12359

Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF [Working Paper]. University of Northern British Columbia, Edgeworth Laboratory for Quantitative Behavioral Science.