3 Assessment Design and Development

Chapter 3 of the Dynamic Learning Maps® (DLM®) Alternate Assessment System 2021–2022 Technical Manual—Year-End Model (Dynamic Learning Maps Consortium, 2022) describes assessment design and development procedures. This chapter provides an overview of updates to assessment design and development for 2022–2023. The chapter first describes the design of English language arts (ELA) reading and writing testlets, as well as mathematics testlets. The chapter then provides an overview of 2022–2023 item writers’ characteristics, followed by the 2022–2023 development of ELA texts and the 2022–2023 external review of items, testlets, and texts based on criteria for content, bias, and accessibility. The chapter concludes by presenting evidence of item quality, including a summary of field-test data analysis and associated reviews, a summary of the pool of operational testlets available for administration, and an evaluation of differential item functioning (DIF).

3.1 Assessment Structure

The DLM Alternate Assessment System uses learning maps as the basis for assessment, which are highly connected representations of how academic skills are acquired as reflected in the research literature. Nodes in the maps represent specific knowledge, skills, and understandings in ELA and mathematics, as well as important foundational skills that provide an understructure for academic skills. The maps go beyond traditional learning progressions to include multiple pathways by which students develop content knowledge and skills.

Four broad claims were developed for ELA and mathematics, which were then subdivided into nine conceptual areas, to organize the highly complex learning maps. For a complete description, see Chapter 2 of the 2021–2022 Technical Manual—Year-End Model (Dynamic Learning Maps Consortium, 2022). Claims are overt statements of what students are expected to learn and be able to demonstrate as a result of mastering skills within a very large neighborhood of the map. Conceptual areas are nested within claims and comprise multiple conceptually related content standards and the nodes that support and extend beyond the standards. The claims and conceptual areas apply to all grades in the DLM system.

Essential Elements (EEs) are specific statements of knowledge and skills, analogous to alternate or extended content standards. The EEs were developed by linking to the grade-level expectations identified in the Common Core State Standards. The purpose of the EEs is to build a bridge from the Common Core State Standards to academic expectations for students with the most significant cognitive disabilities.

Testlets are the basic units of measurement in the DLM system. Testlets are short, instructionally relevant measures of student knowledge, skills, and understandings. Each testlet is made up of three to nine assessment items. Assessment items are developed based on nodes at the five linkage levels for each EE. Each testlet measures an EE and linkage level, with the exception of writing testlets. See Chapter 4 of 2021–2022 Technical Manual—Year-End Model (Dynamic Learning Maps Consortium, 2022) for a description of writing testlets. The Target linkage level reflects the grade-level expectation aligned directly to the EE. For each EE, small collections of nodes are identified earlier in the map that represent critical junctures on the path toward the grade-level expectation. Nodes are also identified beyond the Target at the Successor level to give students an opportunity to grow toward the grade-level targets for students without significant cognitive disabilities.

There are three levels below the Target and one level beyond the Target.

  1. Initial Precursor
  2. Distal Precursor
  3. Proximal Precursor
  4. Target
  5. Successor

3.2 Testlet and Item Writing

This section describes information pertaining to item writing and item writer demographics for the 2022–2023 year. For a complete summary of item and testlet development procedures, see Chapter 3 of the 2021–2022 Technical Manual—Year-End Model (Dynamic Learning Maps Consortium, 2022).

3.2.1 2023 Testlet and Item Writing

Item development for 2022–2023 focused on replenishing and increasing the pool of testlets in all content areas. External item writers and internal test development staff develop items. External item writers focused on writing computer-delivered testlets. Due to the highly templated nature of the teacher-administered testlets, all ELA and mathematics teacher-administered testlets were developed internally by test development staff. A total of 326 computer-delivered testlets were produced by 64 item writers.

3.2.1.1 Item Writers

Item writers were selected from the Accessible Teaching, Learning, and Assessment Systems (ATLAS) MemberClicks database. The database is a profile-based recruitment tool hosted in MemberClicks and includes individuals recruited via the DLM governance board and social media, individuals who have previously participated in item writing, and individuals who created profiles via the “sign up to participate in DLM events” link on the DLM homepage. Interested individuals create and update their participant profile. Participant profiles include demographic, education, and work experience data.

A total of 695 individual profiles were initially invited to participate from the ATLAS MemberClicks database for 2023 item writing. Minimum eligibility criteria included at least 1 year of teaching experience, teaching in a DLM state, and experience with the DLM alternate assessment. Prior DLM event participation, subject matter expertise, population expertise, and distribution of experience in each grade band was also considered in selection and assignment to a subject area. Of the 695 individuals initially invited to participate, 64 individuals registered, completed advance training, and committed to attend the workshop. All 64 registered item writers attended both days of the training event and completed at least rounds 1 and 2 of item writing. Of these item writers, 34 developed ELA testlets and 30 developed mathematics testlets.

The demographics for the item writers are presented in Table 3.1. The median and range of years of teaching experience are shown in Table 3.2. Of the item writers who responded to the question, the median years of experience in pre-K–12 was 15 years for item writers of ELA and mathematics. Item writers equally represented Grades 3–8, with slightly less representation of high school. See Table 3.3 for a summary of item writers’ grade-level teaching experience.

Table 3.1: Demographics of the Item Writers
n %
Gender
Female 60 93.8
Male   3   4.7
Preferred to self-describe   1   1.6
Race
White 59 92.2
African American   5   7.8
Hispanic ethnicity
Non-Hispanic 59 92.2
Hispanic   1   1.6
Chose not to disclose   4   6.2
Table 3.2: Item Writers’ Years of Teaching Experience
Teaching experience n Median Range
Pre-K–12 57 15.0 3–32
English language arts 55 13.0 1–30
Mathematics 55 13.0 1–30
Table 3.3: Item Writers’ Grade-Level Teaching Experience
Grade level n %
Grade 3 35 54.7
Grade 4 37 57.8
Grade 5 37 57.8
Grade 6 36 56.2
Grade 7 34 53.1
Grade 8 36 56.2
High school 22 34.4
Note. Item writers could indicate multiple grade levels.

The 64 item writers represented a highly qualified group of professionals with both content and special education perspectives. The degrees held by item writers are shown in Table 3.4. All item writers held at least a bachelor’s degree. The majority of the item writers (n = 56; 88%) also held a master’s degree, for which the most common field of study was special education (n = 33; 59%).

Table 3.4: Item Writers’ Degree Type (N = 64)
Degree n %
Bachelor’s degree 64 100.0
Education 20   31.2
Special education 22   34.4
Other 18   28.1
Missing   4     6.2
Master’s degree 56 87.5
Education 17 30.4
Special education 33 58.9
Other   6 10.7

Item writers reported a range of experience working with students with different disabilities, as summarized in Table 3.5. Item writers collectively had the most experience working with students with multiple disabilities (n = 57; 89%), a significant cognitive disability (n = 55; 86%), or other health impairments (n = 54; 84%).

Table 3.5: Item Writers’ Experience With Disability Categories
Disability category n %
Blind/low vision 32 50.0
Deaf/hard of hearing 30 46.9
Emotional disability 46 71.9
Mild cognitive disability 45 70.3
Multiple disabilities 57 89.1
Orthopedic impairment 32 50.0
Other health impairment 54 84.4
Significant cognitive disability 55 85.9
Specific learning disability 51 79.7
Speech impairment 45 70.3
Traumatic brain injury 29 45.3
No disability   2   3.1
Note. Item writers could select multiple categories.

The professional roles reported by the 2022–2023 item writers are shown in Table 3.6. While item writers had a range of professional roles, they were primarily classroom educators.

Table 3.6: Professional Roles of Item Writers
Role n %
Building administrator   1   1.6
Classroom educator 45 70.3
District staff   5   7.8
Instructional coach   6   9.4
State education agency   3   4.7
University faculty/staff   1   1.6
Other   3   4.7

Item writers came from 17 different states and the District of Columbia. The geographic areas of the institutions in which item writers taught or held a position is reported in Table 3.7. Within the survey, rural was defined as a population of fewer than 2,000 inhabitants, suburban was defined as a city of 2,000–50,000 inhabitants, and urban was defined as a city of more than 50,000 inhabitants.

Table 3.7: Institution Geographic Areas for Item Writers
Geographic area n %
Rural 26 40.6
Suburban 19 29.7
Urban 19 29.7

3.2.1.2 Item-Writing Process

The item-writing process for 2022–2023 began with item writers completing three advance training modules: an overview of the DLM module and two content-specific modules (either ELA or mathematics). In January 2023, item writers and staff gathered in Kansas City for an on-site item-writing workshop. During this workshop, item writers received additional training and worked on producing and peer reviewing two computer-delivered testlets. Following the on-site workshop, item writers continued producing and peer reviewing computer-delivered testlets virtually via a secure online platform through April 2023.

Between January 2023 and April 2023, staff developed items for teacher-administered testlets internally. This is due to the highly templated nature of the teacher-administered testlets.

Item writers produced a total of 326 computer-delivered testlets (176 in ELA and 150 in mathematics). DLM staff produced a total of 72 teacher-administered testlets (31 in ELA and 41 in mathematics).

3.2.2 External Reviews

3.2.2.1 Items and Testlets

The purpose of external reviews of items and testlets is to evaluate whether items and testlets measure the intended content, are accessible, and are free of bias or sensitive content. Panelists use external review criteria established for DLM alternate assessments to rate items and recommend to “accept”, “revise”, or “reject” items and testlets. External review panelists provide recommendations for revise ratings and explanations for reject ratings. The test development team uses collective feedback from the panelists to inform decisions about items and testlets prior to field-testing.

The content reviews of items and testlets for 2022–2023 were conducted during 2-day virtual meetings. The accessibility and bias/sensitivity reviews for 2022–2023 were conducted during 3-day virtual meetings.

3.2.2.1.1 Overview of Review Process

Panelists were selected from the ATLAS MemberClicks database based on predetermined qualifications for each panel type. Panelists were assigned to content, accessibility, or bias and sensitivity panels based on their qualifications.

In 2023, there were 33 panelists. Of those, seven were ELA panelists and six were mathematics panelists. There were also 16 accessibility panelists and four bias and sensitivity panelists who reviewed items and testlets from all subjects.

Prior to participating in the virtual panel meetings, panelists completed an advance training course that included an External Review Procedures module and a module that specifically aligned to their assigned panel type. The content modules were subject-specific, while the accessibility and bias and sensitivity modules were universal for both subjects. After each module, panelists completed a quiz and were required to score 80% or higher to continue to the next stage of training.

Following the completion of advance training, panelists met with their panels in a virtual environment. Each panel was led by an ATLAS facilitator and co-facilitator. Facilitators provided additional training on the training platform and criteria used to review items and testlets. Panelists began their reviews by engaging in a calibration collection (two testlets) to calibrate their expectations for the review. Following the calibration sets, panelists reviewed collections of items and testlets independently. Once all panelists completed the review, facilitators used a discussion protocol known as the Rigorous Item Feedback protocol to discuss any items or testlets that were rated either revise or reject by a panelist to obtain collective feedback about those items and testlets. The Rigorous Item Feedback protocol helps facilitators elicit detailed, substantive feedback from panelists and record feedback in a uniform fashion. Following the discussion, panelists were given another collection of items and testlets to review. This process was repeated until all collections of items and testlets were reviewed. Collections ranged from eight to 17 testlets, depending on the panel type. Content panels had fewer testlets per collection, and bias and sensitivity and accessibility panels had more testlets per collection.

3.2.2.1.2 External Reviewers

The demographics for the external reviewers are presented in Table 3.8. The median and range of years of teaching experience are shown in Table 3.9. The median years of experience for external reviewers was 17 years in pre-K–12, 14 years in ELA, and 13 years in mathematics. External reviewers represented all grade levels, with slightly greater representation for Grades 6–8 and high school. See Table 3.10 for a summary.

Table 3.8: Demographics of the External Reviewers
n %
Gender
Female 29 87.9
Male   4 12.1
Race
White 29 87.9
African American   2   6.1
American Indian   1   3.0
Chose not to disclose   1   3.0
Hispanic ethnicity
Non-Hispanic 30 90.9
Hispanic   1   3.0
Chose not to disclose   2   6.1
Table 3.9: External Reviewers’ Years of Teaching Experience
Teaching experience Median Range
Pre-K–12 17.0 8–39
English language arts 13.5 1–39
Mathematics 13.0 1–31
Table 3.10: External Reviewers’ Grade-Level Teaching Experience
Grade level n %
Grade 3 17 51.5
Grade 4 15 45.5
Grade 5 21 63.6
Grade 6 26 78.8
Grade 7 24 72.7
Grade 8 23 69.7
High school 22 66.7
Note. Reviewers could indicate multiple grade levels.

The 33 external reviewers represented a highly qualified group of professionals. The level of degree and most common types of degrees held by external reviewers are shown in Table 3.11. A majority (n = 24; 73%) also held a master’s degree, for which the most common field of study was special education (n = 14; 58%).

Table 3.11: External Reviewers’ Degree Type
Degree n %
Bachelor’s degree 33 100.0
Education 10   30.3
Special education 10   30.3
Other 11   33.3
Missing   2     6.1
Master’s degree 24 72.7
Education   4 16.7
Special education 14 58.3
Other   6 25.0

Most external reviewers had experience working with students with disabilities (88%), and 82% had experience with the administration of alternate assessments. The variation in percentages suggests some item writers may have had experience with administration of alternate assessments but perhaps did not regularly work with students with disabilities.

External reviewers reported a range of experience working with students with different disabilities, as summarized in Table 3.12. External reviewers collectively had the most experience working with students with a specific learning disability (n = 26; 79%), a mild cognitive disability (n = 25; 76%), or multiple disabilities (n = 24; 73%).

Table 3.12: External Reviewers’ Experience With Disability Categories
Disability category n %
Blind/low vision 14 42.4
Deaf/hard of hearing 10 30.3
Emotional disability 24 72.7
Mild cognitive disability 25 75.8
Multiple disabilities 24 72.7
Orthopedic impairment 10 30.3
Other health impairment 20 60.6
Significant cognitive disability 21 63.6
Specific learning disability 26 78.8
Speech impairment 22 66.7
Traumatic brain injury 11 33.3
Note. Reviewers could select multiple categories.

Panelists had varying experience teaching students with the most significant cognitive disabilities. ELA and mathematics panelists had a median of 14 years of experience teaching students with the most significant cognitive disabilities, with a minimum of 7 years and a maximum of 31 years of experience.

The professional roles reported by the 2022–2023 reviewers are shown in Table 3.13. While the reviewers had a range of professional roles, they were primarily classroom educators.

Table 3.13: Professional Roles of External Reviewers
Role n %
Classroom educator 30 90.9
District staff   1   3.0
Instructional coach   1   3.0
Other   1   3.0

ELA and mathematics panelists were from eight different states. The geographic areas of institutions in which reviewers taught or held a position is reported in Table 3.14. Within the survey, rural was defined as a population of fewer than 2,000 inhabitants, suburban was defined as a city of 2,000–50,000 inhabitants, and urban was defined as a city of more than 50,000 inhabitants.

Table 3.14: Institution Geographic Areas for Content Panelists
Geographic area n %
Rural 20 60.6
Urban   9 27.3
Suburban   4 12.1
3.2.2.1.3 Results of External Reviews

For ELA, the percentage of items and testlets rated as “accept” across panels and rounds of review ranged from 70% to 96% and from 52% to 87%, respectively. The percentage of items and testlets rated as “revise” across panels and rounds of review ranged from 13% to 48% and from 4% to 30%, respectively. No ELA items or testlets were recommended for rejection.

For mathematics, the percentage of items and testlets rated as “accept” ranged from 37% to 95% and from 47% to 85%, respectively, across panels and rounds of review. The percentage of items and testlets rated as “revise” across panels and rounds of review ranged from 5% to 63% and from 15% to 53%, respectively. No mathematics items or testlets were recommended for rejection.

3.2.2.1.4 Test Development Team Decisions

Because each item and testlet is examined by three distinct panels, ratings were compiled across panels, following the process described in Chapter 3 of the 2021–2022 Technical Manual—Year-End Model (Dynamic Learning Maps Consortium, 2022). The test development team reviews the collective feedback provided by panelists for each item and testlet. Once the test development team views each item and testlet and considers the feedback provided by the panelists, it assigns one of the following decisions to each one: (a) accept as is; (b) minor revision, pattern of minor concerns, will be addressed; (c) major revision needed; (d) reject; and (e) more information needed.

The ELA test development team accepted 72% of testlets and 93% of items as is. Of the items and testlets that were revised, only minor changes (e.g., minor rewording but concept remained unchanged) were required, as opposed to major changes (e.g., stem or response option replaced). The ELA test development team made 17 minor revisions and did not reject any testlets.

The mathematics test development team accepted 40% of testlets and 27% of items as is. Of the items and testlets that were revised, most required major changes (e.g., stem or response option replaced) as opposed to minor changes (e.g., minor rewording but concept remained unchanged). The mathematics test development team made 36 minor revisions and 184 major revisions to items and did not reject any testlets.

Most of the items and testlets reviewed will be field-tested during the fall 2023 or spring 2024 windows.

3.2.2.2 External Review of ELA Texts

The purpose of the external review of texts is to ensure they are representing the intended content, are accessible, are free of bias and sensitivity concerns, and include appropriate imagery. Panelists use external review criteria to rate texts as accept, revise, or reject and provide recommendations for revise ratings or provide an explanation for reject ratings. The ELA test development team uses the collective feedback from the panelists to inform decisions about texts and images before they are used in item and testlet development.

External review of texts for 2022–2023 took place during 2-day virtual panel meetings. During the virtual meetings, facilitators led the feedback discussions and recorded decisions. Senior members of the DLM leadership team used their industry experience to help the ELA team revise established text criteria to better reflect current assessment knowledge and practice. The following criteria were added to the text content criteria for panelists to use during their review of the texts.

  • Consistency of Text Elements: Images do not contain features that contradict aspects of the text or other images in the text.
    • Characters: Images of the same person represent a character consistently within and across text collections. Images of characters in texts align with the characterization in the source books (age, race, historical context).
    • Objects/Settings: Images of the same objects and settings are used consistently within texts. Images used to present objects focus on a single object and do not suggest any element that conflicts with how the objects are described in the text.
  • Node Support: The text and images provide a sufficient opportunity to assess the associated node(s). Items that elicit evidence of the targeted cognition for the typical range of test takers could be written using the text.

The following criterion was added to the text bias and sensitivity criteria for panelists to use during their review of texts (the addition is italicized).

  • Diversity: Where applicable, there is fair representation of diversity in race, ethnicity, gender, disability, and family composition.
    • Informational Texts Images in informational texts should include depictions of gender, racial, disability, and ethnic diversity where appropriate and possible. If it is impossible to show diversity in a single image, diversity should be demonstrated across images, both within and across texts.
3.2.2.2.1 Recruitment, Training, Panel Meetings, and Results

Panelists were selected from the ATLAS MemberClicks database based on predetermined qualifications for each panel type. Panelists were assigned to content, accessibility, or bias and sensitivity panels based on their qualifications.

This year, 46 panelists who had experience with ELA content and/or experience with students with significant cognitive disabilities were recruited to participate. Of the 46 panelists, 37 (80%) were primarily classroom teachers, two (5%) were retired teachers, and seven (15%) had other roles (i.e., instructional coach, item developer, administration). Panelists had varying experience teaching students with severe cognitive disabilities, with a median of 10 years of experience, a minimum of 2 years of experience, and a maximum of 35 years of experience. Of panelists who disclosed their ethnicity, 45 (98%) were non-Hispanic and one (2%) was Hispanic. Among the panelists, 44 (96%) were female and two (4%) were male. Panelists taught or worked in a mix of rural (n = 26, 57%), suburban (n = 11, 24%), and urban (n = 9, 19%) locations. Panelists represented 10 partner states, including four Instructionally Embedded model states and six Year-End model states.

Panelists individually completed the advance training module prior to the panel meetings. Additional training on the structure and process of consensus discussions, panel specific criteria, and resources were provided at the beginning of the panel meeting. During the panel meetings, panelists were given time to read and take notes about a text. Once all panelists had completed the review, facilitators used the Rigorous Item Feedback protocol, which was originally developed for item review and adapted to text review with the aid of an external consultant. The co-facilitator recorded consensus ratings and recommendations for revision on text rating sheets. In cases where panelists recommended revisions, texts were revised to enhance language clarity, cohere with images, or better align with the text criteria. All texts were either accepted based on reviews or accepted after making revisions (Table 3.15).

Table 3.15: 2022–2023 Summary of English Language Arts Text Panel Ratings (Across All Panels) and Final Decisions
Ratings Panel recommendation Final staff decision
Accept as is   1   3
Revise 29 27
Reject   0   0

3.3 Evidence of Item Quality

Each year, testlets are added to and removed from the operational pool to maintain a pool of high-quality testlets. The following sections describe evidence of item quality, including evidence supporting field-test testlets available for administration, a summary of the operational pool, and evidence of DIF.

3.3.1 Field-Testing

During 2022–2023, DLM field-test testlets were administered to evaluate item quality before promoting testlets to the operational pool. Adding additional testlets to the operational pool allows for multiple testlets to be available in the instructionally embedded and spring assessment windows. This means that teachers have the ability to assess the same EE and linkage level multiple times in the instructionally embedded window, if desired, and reduces item exposure for the EEs and linkage levels that are assessed most frequently. Additionally, deepening the operational pool allows for testlets to be evaluated for retirement in instances in which other testlets show better performance.

In this section, we describe the field-test testlets administered in 2022–2023 and the associated review activities. A summary of prior field-test events can be found in Chapter 3 of the 2021–2022 Technical Manual—Year-End Model (Dynamic Learning Maps Consortium, 2022).

3.3.1.1 Description of Field Tests Administered in 2022–2023

The Instructionally Embedded and Year-End assessment models share a common item pool, and testlets field-tested during the instructionally embedded assessment window may be eventually promoted to the spring assessment window.

Testlets were made available for field-testing based on the availability of field-test content for each EE and linkage level.

During the spring assessment window, field-test testlets were administered to each student after completion of the operational assessment. A field-test testlet was assigned for an EE that was assessed during the operational assessment at a linkage level equal or adjacent to the linkage level of the operational testlet.

Table 3.16 summarizes the number of field-test testlets available during 2022–2023. A total of 392 were available across grades, subjects, and windows.

Table 3.16: 2022–2023 Field-Test Testlets, by Subject
Instructionally embedded
assessment window
Spring assessment window
Grade English language arts (n) Mathematics (n) English language arts (n) Mathematics (n)
3 11 7 15 17
4   3 7 11 17
5   2 4   9 15
6   5 3 12 11
7   4 7 14 15
8 10 6 18 14
9   7 7 19 14
10   7 8 19 14
11   5 3 10 12
12   5 3 10 12

Table 3.17 presents the demographic breakdown of students completing field-test testlets in ELA and mathematics in 2022–2023. Consistent with the DLM population, approximately 67% of students completing field-test testlets were male, approximately 60% were White, and approximately 75% were non-Hispanic. The vast majority of students completing field-test testlets were not English-learner eligible or monitored. The students completing field-test testlets were split across the four complexity bands, with most students assigned to Band 1 or Band 2. See Chapter 4 of this manual for a description of the student complexity bands.

Table 3.17: Demographic Summary of Students Participating in Field Tests
English language
arts
Mathematics
Demographic group n % n %
Gender
Male 43,260 67.5 45,400 67.6
Female 20,776 32.4 21,733 32.3
Nonbinary/undesignated       70   0.1       72   0.1
Race
White 36,670 57.2 38,467 57.2
African American 13,490 21.0 14,149 21.1
  7,430 11.6   7,787 11.6
Asian   3,906   6.1   4,072   6.1
American Indian   2,057   3.2   2,156   3.2
Native Hawaiian or Pacific Islander      399   0.6      409   0.6
Alaska Native      154   0.2      165   0.2
Hispanic ethnicity
Non-Hispanic 47,883 74.7 50,092 74.5
Hispanic 16,223 25.3 17,113 25.5
English learning (EL) participation
Not EL eligible or monitored 59,715 93.2 62,579 93.1
EL eligible or monitored   4,391   6.8   4,626   7.0
English language arts complexity band
Foundational   9,397 14.7   9,836 14.6
Band 1 26,226 40.9 27,314 40.6
Band 2 22,839 35.6 23,986 35.7
Band 3   5,644   8.8   6,069   9.0
Mathematics complexity band
Foundational   9,723 15.2 10,140 15.1
Band 1 26,016 40.6 27,056 40.3
Band 2 23,641 36.9 24,938 37.1
Band 3   4,726   7.4   5,071   7.5
Note See Chapter 4 of this manual for a description of student complexity bands.

Participation in field-testing was not required, but educators were encouraged to administer all available testlets to their students. Field-test participation rates for ELA and mathematics in the spring assessment window are shown in Table 3.18. Note that because the Instructionally Embedded and Year-End models share an item pool, participation numbers are combined across all states. In total, 62% of students in ELA and 66% of students in mathematics completed at least one field-test testlet. In the spring assessment window, 94% of field-test testlets had a sample size of at least 20 students.

Table 3.18: 2022–2023 Field-Test Participation, by Subject and Window
Spring assessment window
Subject n %
English language arts 64,106 62.4
Mathematics 67,205 65.6

3.3.1.2 Results of Item Analysis

All flagged items are reviewed by test development teams following field-testing. Items are flagged if they meet either of the following statistical criteria:

  • The item is too challenging, as indicated by a p-value less than .35. This value was selected as the threshold for flagging because most DLM assessment items offer three response options, so a value less than .35 may indicate less than chance selection of the correct response option.

  • The item is significantly easier or harder than other items assessing the same EE and linkage level, as indicated by a weighted standardized difference greater than two standard deviations from the mean p-value for that EE and linkage level combination.

Figure 3.1 and Figure 3.2 summarize the p-values for items that met the minimum sample size threshold of 20. Most items fell above the .35 threshold for flagging. In ELA, 597 items (95%) were above the .35 flagging threshold. In mathematics, 617 items (84%) were above the .35 flagging threshold. Test development teams for each subject reviewed 33 items (5%) for ELA and 120 items (16%) for mathematics that were below the threshold.

Figure 3.1: p-values for English Language Arts 2022–2023 Field-Test Items

This figure contains a histogram displaying p-value on the x-axis and the number of English language arts field test items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Figure 3.2: p-values for Mathematics 2022–2023 Field-Test Items

This figure contains a histogram displaying p-value on the x-axis and the number of mathematics field test items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Items in the DLM assessments are designed and developed to be fungible (i.e., interchangeable) within each EE and linkage level, meaning field-test items should perform consistently with the operational items measuring the same EE and linkage level. To evaluate whether field-test items perform similarly to operational items measuring the same EE and linkage level, standardized difference values were also calculated for the field-test items. Figure 3.3 and Figure 3.4 summarize the standardized difference values for items field-tested during the instructionally embedded window for ELA and mathematics, respectively. Most items fell within two standard deviations of the mean for the EE and linkage level. Items beyond the threshold were reviewed by test development teams for each subject.

Figure 3.3: Standardized Difference Z-Scores for English Language Arts 2022–2023 Field-Test Items

This figure contains a histogram displaying standardized difference on the x-axis and the number of English language arts field test items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Figure 3.4: Standardized Difference Z-Scores for Mathematics 2022–2023 Field-Test Items

This figure contains a histogram displaying standardized difference on the x-axis and the number of mathematics field test items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

A total of 32 ELA testlets (27%) and 61 mathematics testlets (44%) had at least one item flagged due to their p-value and/or standardized difference value. Test development teams reviewed all flagged items and their context within the testlet to identify possible reasons for the flag and to determine whether an edit was likely to resolve the issue.

Of the 85 ELA testlets that were not flagged, one (1%) was edited and reassigned to the field-test pool for content-based reasons (e.g., changes to item wording), 81 (95%) were promoted as is to the operational pool, and three (4%) were sent back to the field-test pool with no edits for additional data collection to get estimates of item difficulty that are based on larger samples. Of the 32 ELA testlets that were flagged, three (9%) were edited and reassigned to the field-test pool and 29 (91%) were sent back to the field-test pool with no edits for additional data collection to get estimates of item difficulty that are based on larger samples. Of the 78 mathematics testlets that were not flagged, two (3%) were edited and reassigned to the field-test pool for content-based reasons, 65 (83%) were promoted as is to the operational pool, and 11 (14%) were rejected and retired. Of the 61 mathematics testlets that were flagged, seven (11%) were edited and reassigned to the field-test pool, 42 (69%) were promoted as is to the operational pool to maintain pool depth given testlet retirement, and 12 (20%) were rejected and retired.

In addition to these reviews, field-test items were reviewed for DIF following the same procedures for items in the operational pool (see Section 3.3.3 of this manual). A total of three field-test items in ELA and six field-test items in mathematics were flagged for nonnegligible DIF. Of the three ELA items that were reviewed by the test development team as part of their item review, one item was assigned to be edited and field-tested again and two items were sent back to the field-test pool with no edits. Of the six mathematics items that were reviewed by the test development team as part of their item review, six items were sent back to the field-test pool with no edits.

3.3.1.3 Field-Test Data Review

Data collected during each field test are compiled, and statistical flags are implemented ahead of test development team review. Flagging criteria serve as a source of evidence for test development teams in evaluating item quality; however, final judgments are content based, taking into account the testlet as a whole, the underlying nodes in the DLM maps that the items were written to assess, and pool depth.

Review of field-test data occurs annually during February and March. This review includes data from the immediately preceding instructionally embedded and spring assessment windows. That is, the review in February and March of 2023 includes field-test data collected during the spring 2022 assessment window and the instructionally embedded assessment window of 2022–2023. Data that were collected during the 2023 spring assessment window will be reviewed in February and March of 2024, with results included in the 2023–2024 technical manual update.

Test development teams for each subject classified each reviewed item into one of four categories:

  1. No changes made to item. Test development team decided item can go forward to operational assessment.
  2. Test development team identified concerns that required modifications. Modifications were clearly identifiable and were likely to improve item performance.
  3. Test development team identified concerns that required modifications. The content was worth preserving rather than rejecting. Item review may not have clearly pointed to specific modifications that were likely to improve the item.
  4. Rejected item. Test development team determined the item was not worth revising.

For an item to be accepted as is, the test development team had to determine that the item was consistent with DLM item-writing guidelines and that the item was aligned to the node. An item or testlet was rejected completely if it was inconsistent with DLM item-writing guidelines, if the EE and linkage level were covered by other testlets that had better-performing items, or if there was no clear content-based revision to improve the item. In some instances, a decision to reject an item resulted in the rejection of the testlet, as well.

Common reasons for flagging an item for modification included items that were misaligned to the node, distractors that could be argued as partially correct, or unnecessary complexity in the language of the stem. After reviewing flagged items, the test development team looked at all items classified into Category 3 or Category 4 within the testlet to help determine whether to retain or reject the testlet. Here, the test development team could elect to keep the testlet (with or without revision) or reject it. If a revision was needed, it was assumed the testlet needed field-testing again. The entire testlet was rejected if the test development team determined the flagged items could not be adequately revised.

3.3.2 Operational Assessment Items for 2022–2023

There were several updates to the pool of operational items for 2022–2023: 72 testlets were promoted to the operational pool from field-testing in 2021–2022, including 47 ELA testlets and 25 mathematics testlets. Additionally, two testlets (<1% of all testlets) were retired due to model fit. For a discussion of the model-based retirement process, see Chapter 5 of this manual.

Testlets were made available for operational testing in 2022–2023 based on the 2021–2022 operational pool and the promotion of testlets field-tested during 2021–2022 to the operational pool following their review. Table 3.19 summarizes the total number of operational testlets for 2022–2023. In total, there were 945 operational testlets available. This total included 342 EE/linkage level combinations (192 ELA, 150 mathematics) for which both a general version and a version for students who are blind or visually impaired or read braille were available.

Operational assessments were administered during the spring assessment window. A total of 1,451,585 test sessions were administered during both assessment windows. One test session is one testlet taken by one student. Only test sessions that were complete at the close of each testing window counted toward the total sessions.

Table 3.19: 2022–2023 Operational Testlets, by Subject (N = 945)
Grade English language arts (n) Mathematics (n)
3 70   44
4 74   47
5 71   51
6 70   45
7 71   39
8 64   45
9–10 64 123
11–12 67 *
* In mathematics, high school is banded in Grades 9–11.

3.3.2.1 Educator Perception of Assessment Content

Each year, test administrators are asked two questions about their perceptions of the assessment content; Participation in the test administrator survey is described in Chapter 4 of this manual. Table 3.20 describes their responses in 2022–2023. Questions pertained to whether the DLM assessments measured important academic skills and reflected high expectations for their students.

Test administrators generally responded that content reflected high expectations for their students (86% agreed or strongly agreed) and measured important academic skills (78% agreed or strongly agreed). While the majority of test administrators agreed with these statements, 14%–22% disagreed. DLM assessments represent a departure from the breadth of academic skills assessed by many states’ previous alternate assessments. Given the short history of general curriculum access for this population and the tendency to prioritize the instruction of functional academic skills (Karvonen et al., 2011), test administrators’ responses may reflect awareness that DLM assessments contain challenging content. However, test administrators were divided on its importance in the educational programs of students with the most significant cognitive disabilities. Feedback from focus groups with educators focusing on score reports reflected similar variability in educator perceptions of assessment content (Clark et al., 2018, 2022).

Table 3.20: Educator Perceptions of Assessment Content
Strongly
disagree
Disagree
Agree
Strongly
agree
Statement n % n % n % n %
Content measured important academic skills and knowledge for this student. 3,653 8.0 6,568 14.3 27,351 59.6   8,319 18.1
Content reflected high expectations for this student. 1,949 4.3 4,296   9.4 27,461 60.3 11,862 26.0

3.3.2.2 Psychometric Properties of Operational Assessment Items for 2022–2023

The proportion correct (p-value) was calculated for all operational items to summarize information about item difficulty.

Figure 3.5 and Figure 3.6 show the distribution of p-values for operational items in ELA and mathematics, respectively. To prevent items with small sample sizes from potentially skewing the results, the sample size cutoff for inclusion in the p-value plots was 20. In total, 15 items (<1% of all items) were excluded due to small sample size, where nine of the items were ELA items (1% of all ELA items) and six of the items were mathematics items (<1% of all mathematics items). In general, ELA items were easier than mathematics items, as evidenced by the presence of more items in the higher bin (p-value) ranges.

Figure 3.5: p-values for English Language Arts 2022–2023 Operational Items

A histogram displaying p-value on the x-axis and the number of English language arts operational items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Figure 3.6: p-values for Mathematics 2022–2023 Operational Items

A histogram displaying p-value on the x-axis and the number of mathematics operational items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Items in the DLM assessments are designed and developed to be fungible (i.e., interchangeable) within each EE and linkage level, meaning that the items are expected to function identically to the other items measuring the same EE and linkage level. To evaluate the fungibility assumption, standardized difference values were calculated for all operational items, with a student sample size of at least 20 required to compare the p-value for the item to all other items measuring the same EE and linkage level. If an item is fungible with the other items measuring the same EE and linkage level, the item is expected to have a nonsignificant standardized difference value. The standardized difference values provide one source of evidence of internal consistency.

Figure 3.7 and Figure 3.8 summarize the distributions of standardized difference values for operational items in ELA and mathematics, respectively. Of all items measuring the EE and linkage level, 99% of ELA items and 99% of mathematics items fell within two standard deviations of the mean. As additional data are collected and decisions are made regarding item pool replenishment, test development teams will consider item standardized difference values, along with item misfit analyses, when determining which items and testlets are recommended for retirement.

Figure 3.7: Standardized Difference Z-Scores for English Language Arts 2022–2023 Operational Items

This figure contains a histogram displaying standardized difference on the x-axis and the number of English language arts operational items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Figure 3.8: Standardized Difference Z-Scores for Mathematics 2022–2023 Operational Items

This figure contains a histogram displaying standardized difference on the x-axis and the number of mathematics operational items on the y-axis.

Note. Items with a sample size less than 20 were omitted.

Figure 3.9 summarizes the distributions of standardized difference values for operational items by linkage level. Most items fell within two standard deviations of the mean of all items measuring the respective EE and linkage level, and the distributions are consistent across linkage levels.

Figure 3.9: Standardized Difference Z-Scores for 2022–2023 Operational Items by Linkage Level

This figure contains a histogram displaying standardized difference on the x-axis and the number of science operational items on the y-axis. The histogram has a separate row for each linkages level.

Note. Items with a sample size less than 20 were omitted.

3.3.3 Evaluation of Item-Level Bias

DIF identifies instances where test items are more difficult for some groups of examinees despite these examinees having similar knowledge and understanding of the assessed concepts (Camilli & Shepard, 1994). DIF analyses can uncover internal inconsistency if particular items are functioning differently in a systematic way for identifiable subgroups of students (American Educational Research Association et al., 2014). While identification of DIF does not always indicate a weakness in the test item, it can point to construct-irrelevant variance, posing considerations for validity and fairness.

3.3.3.1 Method

DIF analyses followed the same procedure used in previous years and examined race in addition to gender. Analyses included data from 2015–2016 through 2021–2022 DIF analyses are conducted on the sample of data used to update the model calibration, which uses data through the previous operational assessment. See Chapter 5 of this manual for more information. to flag items for evidence of DIF. Items were selected for inclusion in the DIF analyses based on minimum sample-size requirements for the three gender subgroups (female, male, and nonbinary/undesignated) and for race subgroups: African American, Alaska Native, American Indian, Asian, multiple races, Native Hawaiian or Pacific Islander, and White.

The DLM student population is unbalanced in both gender and race. The number of female students responding to items is smaller than the number of male students by a ratio of approximately 1:2, and nonbinary/undesignated students make up less than 0.1% of the DLM student population. Similarly, the number of non-White students responding to items is smaller than the number of White students by a ratio of approximately 1:2. Therefore, on advice from the DLM Technical Advisory Committee, the threshold for including an item in DIF analysis requires that the focal group (i.e., the historically disadvantaged group) must have at least 100 students responding to the item. The threshold of 100 was selected to balance the need for a sufficient sample size in the focal group with the relatively low number of students responding to many DLM items. Writing items were excluded from the DIF analyses described here because they include nonindependent response options.

Consistent with previous years, additional criteria were included to prevent estimation errors. Items with an overall proportion correct (p-value) greater than .95 or less than .05 were removed from the analyses. Items for which the p-value for one gender or racial group was greater than .97 or less than .03 were also removed from the analyses.

For each item, logistic regression was used to predict the probability of a correct response, given group membership and performance in the subject. Specifically, the logistic regression equation for each item included a matching variable comprised of the student’s total linkage levels mastered in the subject of the item and a group membership variable, with the reference group (i.e., males for gender, White for race) coded as 1 and the focal group (i.e., females or nonbinary/undesignated for gender; African American, Asian, American Indian, Native Hawaiian or Pacific Islander, Alaska Native, or two or more races for race) coded as 0. An interaction term was included to evaluate whether nonuniform DIF was present for each item (Swaminathan & Rogers, 1990); the presence of nonuniform DIF indicates that the item functions differently because of the interaction between total linkage levels mastered and the student’s group (i.e., gender or racial group). When nonuniform DIF is present, the group with the highest probability of a correct response to the item differs along the range of total linkage levels mastered; thus, one group is favored at the low end of the spectrum and the other group is favored at the high end.

Three logistic regression models were fitted for each item:

\[\begin{align} \text{M}_0\text{: } \text{logit}(\pi_i) &= \beta_0 + \beta_1\text{X} \tag{3.1} \\ \text{M}_1\text{: } \text{logit}(\pi_i) &= \beta_0 + \beta_1\text{X} + \beta_2G \tag{3.2} \\ \text{M}_2\text{: } \text{logit}(\pi_i) &= \beta_0 + \beta_1\text{X} + \beta_2G + \beta_3\text{X}G\tag{3.3} \end{align}\]

where \(\pi_i\) is the probability of a correct response to item i, \(\text{X}\) is the matching criterion, \(G\) is a dummy coded grouping variable (0 = reference group, 1 = focal group), \(\beta_0\) is the intercept, \(\beta_1\) is the slope, \(\beta_2\) is the group-specific parameter, and \(\beta_3\) is the interaction term.

Because of the number of items evaluated for DIF, Type I error rates were susceptible to inflation. The incorporation of an effect-size measure can be used to distinguish practical significance from statistical significance by providing a metric of the magnitude of the effect of adding group and interaction terms to the regression model.

For each item, the change in the Nagelkerke pseudo \(R^2\) measure of effect size was captured, from \(M_0\) to \(M_1\) or \(M_2\), to account for the effect of the addition of the group and interaction terms to the equation. All effect-size values were reported using both the Zumbo and Thomas (1997) and Jodoin and Gierl (2001) indices for reflecting a negligible (also called A-level DIF), moderate (B-level DIF), or large effect (C-level DIF). The Zumbo and Thomas thresholds for classifying DIF effect size are based on Cohen’s (1992) guidelines for identifying a small, medium, or large effect. The thresholds for each level are .13 and .26; values less than .13 have a negligible effect, values between .13 and .26 have a moderate effect, and values of .26 or greater have a large effect. The Jodoin and Gierl thresholds are more stringent, with lower threshold values of .035 and .07 to distinguish between negligible, moderate, and large effects.

3.3.3.2 DIF Results

Using the above criteria for inclusion, 2,769 (87%) items were selected for at least one gender group comparison, and 2,570 (81%) items were selected for at least one racial group comparison. The number of items evaluated by grade and subject for gender ranged from five in grade 11–12 ELA to 275 in grades 9–10 mathematics. Because students taking DLM assessments represent three possible gender groups, there are up to two comparisons that can be made for each item, with the male group as the reference group and each of the other two groups (i.e., female, nonbinary/undesignated) as the focal group. Across all items, this results in 6,340 possible comparisons. Using the inclusion criteria specified above, 2,769 (44%) item and focal group comparisons were selected for analysis. All 2,769 items were evaluated for the female focal group. The number of items evaluated by grade and subject for race ranged from 39 in grades 9–10 ELA to 235 in grades 9–10 mathematics. Because students taking DLM assessments represent seven possible racial groups, See Chapter 7 of this manual for a summary of participation by race and other demographic variables. there are up to six comparisons that can be made for each item, with the White group as the reference group and each of the other six groups (i.e., African American, Asian, American Indian, Native Hawaiian or Pacific Islander, Alaska Native, two or more races) as the focal group. Across all items, this results in 19,020 possible comparisons. Using the inclusion criteria specified above, 7,040 (37%) item and focal group comparisons were selected for analysis. Overall, 178 items were evaluated for one racial focal group, 980 items were evaluated for two racial focal groups, 746 items were evaluated for three racial focal groups, and 666 items were evaluated for four racial focal groups. One racial focal group and the White reference group were used in each comparison. Table 3.21 shows the number of items that were evaluated for each racial focal group. Across all gender and race comparisons, sample sizes for each comparison ranged from 237 to 12,525 for gender and from 431 to 10,333 for race.

Table 3.21: Number of Items Evaluated for Differential Item Functioning for Each Race
Focal group Items (n)
African American 2,568
American Indian    671
Asian 1,407
Two or more races 2,394

Table 3.22 and Table 3.23 show the number and percentage of subgroup combinations that did not meet each inclusion criteria for gender and race, respectively, by subject and the linkage level the items assess. A total of 401 items were not included in the DIF analysis for gender for any of the subgroups. Of the 3,571 item and focal group comparisons that were not included in the DIF analysis for gender, 3,478 (97%) had a focal group sample size of less than 100, 67 (2%) had an item p-value greater than .95, and 26 (1%) had a subgroup p-value greater than .97. A total of 600 items were not included in the DIF analysis for race for any of the subgroups. Of the 11,980 item and focal group comparisons that were not included in the DIF analysis for race, 11,499 (96%) had a focal group sample size of less than 100, 178 (1%) had an item p-value greater than .95, and 303 (3%) had a subgroup p-value greater than .97. The majority of nonincluded comparisons come from ELA for both gender (n = 2,000; 56%) and race (n = 6,796; 57%).

Table 3.22: Comparisons Not Included in Differential Item Functioning Analysis for Gender, by Subject and Linkage Level
Sample
size
Item
proportion
correct
Subgroup
proportion
correct
Subject and linkage level n % n % n %
English language arts
Initial Precursor 390 20.2   0   0.0   0   0.0
Distal Precursor 472 24.4   0   0.0   0   0.0
Proximal Precursor 441 22.8   0   0.0   2   9.5
Target 278 14.4 10 21.3   4 19.0
Successor 351 18.2 37 78.7 15 71.4
Mathematics
Initial Precursor 309 20.0   0   0.0   0   0.0
Distal Precursor 271 17.5   0   0.0   0   0.0
Proximal Precursor 330 21.3 12 60.0   1 20.0
Target 317 20.5   6 30.0   2 40.0
Successor 319 20.6   2 10.0   2 40.0
Table 3.23: Comparisons Not Included in Differential Item Functioning Analysis for Race, by Subject and Linkage Level
Sample
size
Item
proportion
correct
Subgroup
proportion
correct
Subject and linkage level n % n % n %
English language arts
Initial Precursor 1,311 20.2     0   0.0   1   0.6
Distal Precursor 1,509 23.2     0   0.0 10   5.9
Proximal Precursor 1,482 22.8     0   0.0 37 21.9
Target    998 15.4   27 21.3 46 27.2
Successor 1,200 18.5 100 78.7 75 44.4
Mathematics
Initial Precursor    890 17.8     0   0.0   0   0.0
Distal Precursor    742 14.8     0   0.0 14 10.4
Proximal Precursor 1,076 21.5   32 62.7 28 20.9
Target 1,169 23.4   13 25.5 42 31.3
Successor 1,122 22.4     6 11.8 50 37.3
3.3.3.2.1 Uniform Differential Item Functioning Model

A total of 395 items for gender were flagged for evidence of uniform DIF. Additionally, 903 item and focal group combinations across 751 items for race were flagged for evidence of uniform DIF. Table 3.24 and Table 3.25 summarize the total number of combinations flagged for evidence of uniform DIF by subject and grade for gender and race, respectively. The percentage of combinations flagged for uniform DIF ranged from 8% to 21% for gender and from 8% to 23% for race.

Table 3.24: Combinations Flagged for Evidence of Uniform Differential Item Functioning for Gender
Grade Items flagged (n) Total items (N) Items flagged (%) Items with moderate or large effect size (n)
English language arts
3 15 176   8.5 0
4 23 186 12.4 0
5 26 185 14.1 0
6 28 188 14.9 0
7 23 187 12.3 0
8 22 185 11.9 0
10   6   39 15.4 0
11 27 183 14.8 0
9–10 25 138 18.1 0
Mathematics
3 22 151 14.6 0
4 34 162 21.0 0
5 23 165 13.9 0
6 24 150 16.0 0
7 24 122 19.7 0
8 28 162 17.3 0
9 11 141   7.8 0
10 14 134 10.4 0
11 20 106 18.9 0
Table 3.25: Combinations Flagged for Evidence of Uniform Differential Item Functioning for Race
Grade Items flagged (n) Total items (N) Items flagged (%) Items with moderate or large effect size (n)
English language arts
3 61 466 13.1 0
4 62 502 12.4 0
5 58 476 12.2 0
6 48 462 10.4 0
7 56 474 11.8 0
8 56 425 13.2 0
10 11   80 13.8 0
11 57 429 13.3 0
9–10 72 318 22.6 0
Mathematics
3 62 426 14.6 0
4 53 463 11.4 0
5 43 477   9.0 0
6 61 432 14.1 0
7 55 358 15.4 0
8 74 474 15.6 0
9 22 292   7.5 0
10 24 201 11.9 0
11 28 285   9.8 0

For gender, using the Zumbo and Thomas (1997) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the gender term was added to the regression equation. When using the Jodoin and Gierl (2001) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the gender term was added to the regression equation.

The results of the DIF analyses for race were similar to those for gender. When using the Zumbo and Thomas (1997) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the race term was added to the regression equation. Similarly, when using the Jodoin and Gierl (2001) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the race term was added to the regression equation.

3.3.3.2.2 Nonuniform Differential Item Functioning Model

A total of 452 items were flagged for evidence of nonuniform DIF. Additionally, 1,000 item and focal group combinations across 810 items were flagged for evidence of nonuniform DIF when both the race and interaction terms were included in the regression equation. Table 3.26 and Table 3.27 summarize the number of combinations flagged by subject and grade. The percentage of combinations flagged ranged from 8% to 60% for gender and from 10% to 24% for race.

Table 3.26: Items Flagged for Evidence of Nonuniform Differential Item Functioning for Gender
Grade Items flagged (n) Total items (N) Items flagged (%) Items with moderate or large effect size (n)
English language arts
3 14 176   8.0 0
4 28 186 15.1 0
5 23 185 12.4 0
6 33 188 17.6 0
7 26 187 13.9 0
8 23 185 12.4 0
10   6   39 15.4 0
11 30 183 16.4 0
9–10 22 138 15.9 0
11–12   3     5 60.0 0
Mathematics
3 28 151 18.5 0
4 37 162 22.8 0
5 31 165 18.8 0
6 29 150 19.3 0
7 31 122 25.4 0
8 32 162 19.8 0
9 20 141 14.2 0
10 15 134 11.2 0
11 21 106 19.8 0
Table 3.27: Items Flagged for Evidence of Nonuniform Differential Item Functioning for Race
Grade Items flagged (n) Total items (N) Items flagged (%) Items with moderate or large effect size (n)
English language arts
3 54 466 11.6 0
4 67 502 13.3 0
5 73 476 15.3 0
6 45 462   9.7 0
7 53 474 11.2 0
8 61 425 14.4 1
10   9   80 11.2 0
11 66 429 15.4 0
9–10 66 318 20.8 0
Mathematics
3 58 426 13.6 0
4 80 463 17.3 0
5 46 477   9.6 0
6 61 432 14.1 0
7 87 358 24.3 0
8 88 474 18.6 0
9 28 292   9.6 0
10 24 201 11.9 0
11 34 285 11.9 0

Using the Zumbo and Thomas (1997) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the gender and interaction terms were added to the regression equation. When using the Jodoin and Gierl (2001) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the gender and interaction terms were added to the regression equation.

The results of the DIF analyses for race were similar to those for gender. When using the Zumbo and Thomas (1997) effect-size classification criteria, all but one combination were found to have a negligible effect-size change after the race and interaction terms were added to the regression equation. Similarly, when using the Jodoin and Gierl (2001) effect-size classification criteria, all but one combination were found to have a negligible effect-size change after the race and interaction terms were added to the regression equation.

Information about the flagged items with a nonnegligible change in effect size after adding both the group and interaction term is summarized in Table 3.28, where B indicates a moderate effect size and C a large effect size. The test development team reviews all items flagged with a moderate or large effect size. In total, one combination had a large effect size. The \(\beta_3\text{X}G\) values in Table 3.28 indicate which group was favored at lower and higher numbers of linkage levels mastered. All combinations favored the focal group lower numbers of total linkage levels mastered and the reference group at higher numbers of total linkage levels mastered.

Table 3.28: Combinations Flagged for Nonuniform Differential Item Functioning (DIF) With Moderate or Large Effect Size
Item ID Focal Grade EE \(\chi^2\) \(p\)-value \(\beta_2G\) \(\beta_3\text{X}G\) \(R^2\) Z&T* J&G*
English language arts
54118 Asian 8 ELA.EE.RL.8.3 10.74 .005 0.52 −0.05   .747 C C
Note. EE = Essential Element; \(\beta_2G\) = the coefficient for the group term (Equation (3.3)); Z&T = Zumbo & Thomas; J&G = Jodoin & Gierl.
* Effect-size measure: A indicates evidence of negligible DIF, B indicates evidence of moderate DIF, and C indicates evidence of large DIF.

3.4 Conclusion

During 2022–2023, the test development teams conducted virtual events for both item writing and external review. Overall, 398 testlets were written for ELA and mathematics. Additionally, 30 new texts were written for ELA. Following external review, the test development team retained 72% and 40% of ELA and mathematics testlets, respectively. The test development team also retained 10% of the developed ELA texts. Of the content already in the operational pool, most items had p-values within two standard deviations of the mean for the EE and linkage level, zero items were flagged for nonnegligible uniform DIF, and one item was flagged for nonnegligible nonuniform DIF. Field-testing in 2022–2023 focused on collecting data to refresh the operational pool of testlets.