5 Modeling

Chapter 5 of the Dynamic Learning Maps® (DLM®) Alternate Assessment System 2021–2022 Technical Manual—Year-End Model (Dynamic Learning Maps Consortium, 2022) provides a complete description of the psychometric model used to calibrate and score the DLM assessments, including the psychometric background, the structure of the assessment system, suitability for diagnostic modeling, and a detailed summary of the procedures used to calibrate and score DLM assessments. This chapter provides a high-level summary of the model used to calibrate and score assessments, along with a summary of updated modeling evidence from the 2022–2023 administration year.

5.1 Psychometric Background

Learning maps, which are networks of sequenced learning targets, are at the core of the DLM assessments in English language arts and mathematics. Because of the underlying map structure and the goal of providing more fine-grained information beyond a single raw or scale score value, student results are reported as a profile of skill mastery. This profile is created using diagnostic classification modeling (e.g., Bradshaw, 2016), which draws on research in cognition and learning theory to provide information about student mastery of multiple skills measured by the assessment. Results are reported for each alternate content standard, called an Essential Element (EE), at the five levels of complexity (“linkage levels”) for which assessments are available: Initial Precursor, Distal Precursor, Proximal Precursor, Target, and Successor.

Each linkage level is calibrated separately for each EE using separate log-linear cognitive diagnosis models [LCDMs; Henson et al. (2009)]. Each linkage level within an EE is estimated separately because of the administration design, in which it is uncommon for students to take testlets at multiple levels for an EE. Also, because items are developed to meet a precise cognitive specification, See Chapter 3 of the 2021–2022 Technical Manual—Year-End Model (Dynamic Learning Maps Consortium, 2022). all master and nonmaster probability parameters for items measuring a linkage level are assumed to be equal. That is, all items are assumed to be fungible, or exchangeable, within a linkage level.

The DLM scoring model for the 2022–2023 administration was implemented as follows. Each linkage level within each EE was considered the latent variable to be measured (the attribute). Using diagnostic classification models (DCMs), a probability of mastery on a scale of 0 to 1 was calculated for each linkage level within each EE. Students were then classified into one of two classes for each linkage level of each EE: master or nonmaster. As described in Chapter 6 and Chapter 7 of the 2021–2022 Technical Manual—Year-End Model (Dynamic Learning Maps Consortium, 2022), a posterior probability of at least .8 was required for mastery classification. As per the assumption of item fungibility, a single set of probabilities was estimated for all items within a linkage level. Finally, only a single structural parameter was needed (\(\nu\)), which is the probability that a randomly selected student who is assessed on the linkage level is a master. In total, three parameters per linkage level are specified in the DLM scoring model: a fungible intercept, a fungible main effect, and the proportion of masters.

Once the LCDM parameters have been calibrated, student mastery probabilities are obtained for each assessed linkage level, and these probabilities are used to determine the highest linkage level mastered for each EE. Although connections between linkage levels are not modeled empirically, they are used in the scoring procedures (see the 7.3.1 section). In particular, if the LCDM determines a student has mastered a given linkage level within an EE, then the student is assumed to have mastered all lower levels within that EE.

5.2 Model Evaluation

Model fit and classification accuracy are critical to making valid inferences about student mastery. If the model used to calibrate and score the assessment does not fit the data well, results from the assessment may not accurately reflect what students know and can do. Also called absolute model fit (e.g., Chen et al., 2013), model fit involves an evaluation of the alignment between the three parameters estimated for each linkage level and the observed item responses. Classification accuracy refers to how well the classifications represent the true underlying latent class. The accuracy of the assessment results (i.e., the classifications) is a prerequisite for the validity of inferences made from the results. Thus, the accuracy of the classifications is perhaps the most crucial aspect of model evaluation from a practical and operational standpoint. Model fit and classification accuracy results from the 2022–2023 administration year are provided in the following sections.

For a complete description of the methods and process used to evaluate model fit and classification accuracy, see Chapter 5 of the 2021–2022 Technical Manual—Year-End Model (Dynamic Learning Maps Consortium, 2022).

5.2.1 Model Fit

Linkage levels were flagged for misfit if the adjusted posterior predictive p-value (ppp) was less than .05. Table 5.1 shows the percentage of models with acceptable model fit (i.e., \(ppp \geq .05\)), by linkage level. Across all linkage levels, 564 (69%) of the estimated models showed acceptable model fit. Misfit was not evenly distributed across the linkage levels. The lower linkage levels were flagged at a higher rate than the higher linkage levels. This is likely due to the greater diversity in the student population at the lower linkage levels (e.g., required supports, expressive communication behaviors, etc.), which may affect item response behavior. To address the areas where misfit was detected, we are prioritizing test development for linkage levels flagged for misfit so that testlets contributing to misfit can be retired. For a description of item development practices, see Chapter 3 of the 2021–2022 Technical Manual—Year-End Model (Dynamic Learning Maps Consortium, 2022). We also plan to incorporate additional item quality statistics to the review of field test data to ensure that only items and testlets that conform to the model expectations are promoted to the operational assessment. Overall, however, the fungible LCDM models appear to largely reflect the observed data. Additionally, model fit is evaluated on an annual basis and continues to improve over time as a result of adjustments to the pool of available content (i.e., improved item writing practices, retirement of testlets contributing to misfit). Finally, it should be noted that a linkage level flagged for model misfit may still have high classification accuracy, indicating that student mastery classifications are accurate, even in the presence of misfit.

Table 5.1: Number and Percentage of Models With Acceptable Model Fit for the 2022–2023 Administration Year (ppp > .05)
Linkage level English language arts (%) Mathematics (%)
Initial Precursor 55 (57.3) 11 (16.4)
Distal Precursor 58 (60.4) 40 (59.7)
Proximal Precursor 76 (79.2) 46 (68.7)
Target 82 (85.4) 46 (68.7)
Successor 87 (90.6) 63 (94.0)
Note. ppp = posterior predictive p-value. ppp > .05 indicates acceptable model fit.

5.2.2 Classification Accuracy

Table 5.2 shows the number and percentage of models within each linkage level that demonstrated each category of classification accuracy. Across all estimated models, 612 linkage levels (75%) demonstrated at least fair classification accuracy. Results are fairly consistent across linkage levels, with no one level showing systematically higher or lower accuracy. As was the case for model misfit, linkage levels flagged for low classification accuracy are prioritized for test development.

Table 5.2: Estimated Classification Accuracy by Linkage Level for the 2022–2023 Administration Year
Weak
(%)
Poor
(%)
Fair
(%)
Good
(%)
Very good
(%)
Excellent
(%)
Linkage level 0.00–.55 .55–.82 .83–.88 .89–.94 .95–.98 .99–1.00
English language arts
Initial Precursor 0 (0.0)   5   (5.2) 18 (18.8) 44 (45.8) 18 (18.8) 11 (11.5)
Distal Precursor 0 (0.0) 29 (30.2) 26 (27.1) 20 (20.8)   9   (9.4) 12 (12.5)
Proximal Precursor 0 (0.0) 43 (44.8) 29 (30.2) 14 (14.6)   6   (6.2)   4   (4.2)
Target 0 (0.0) 29 (30.2) 37 (38.5) 18 (18.8)   6   (6.2)   6   (6.2)
Successor 0 (0.0) 27 (28.1) 23 (24.0) 25 (26.0) 15 (15.6)   6   (6.2)
Mathematics
Initial Precursor 0 (0.0)   0   (0.0) 14 (20.9) 42 (62.7) 11 (16.4)   0   (0.0)
Distal Precursor 0 (0.0) 24 (35.8) 24 (35.8) 15 (22.4)   3   (4.5)   1   (1.5)
Proximal Precursor 0 (0.0) 23 (34.3) 18 (26.9) 20 (29.9)   6   (9.0)   0   (0.0)
Target 0 (0.0) 10 (14.9) 23 (34.3) 23 (34.3)   9 (13.4)   2   (3.0)
Successor 0 (0.0) 13 (19.4) 22 (32.8) 23 (34.3)   8 (11.9)   1   (1.5)

When looking at absolute model fit and classification accuracy in combination, linkage levels flagged for absolute model misfit often have high classification accuracy. Of the 251 linkage levels that were flagged for absolute model misfit, 230 (92%) showed fair or better classification accuracy. Thus, even when misfit is present, we can be confident in the accuracy of the mastery classifications. In total, 97% of linkage levels (n = 794) had acceptable absolute model fit and/or acceptable classification accuracy.

5.3 Calibrated Parameters

As stated previously in this chapter, the item parameters for diagnostic assessments are the conditional probability of masters and nonmasters providing a correct response. Because of the assumption of fungibility, parameters are calculated for each of the 815 linkage levels across English language arts and mathematics. Parameters include a conditional probability of nonmasters providing a correct response and a conditional probability of masters providing a correct response. Across all linkage levels, the conditional probability that masters provide a correct response is generally expected to be high, while it is expected to be low for nonmasters. In addition to the item parameters, the psychometric model also includes a structural parameter, which defines the base rate of class membership for each linkage level. A summary of the operational parameters used to score the 2022–2023 assessment is provided in the following sections.

5.3.1 Probability of Masters Providing Correct Response

When items measuring each linkage level function as expected, students who have mastered the linkage level have a high probability of providing a correct response. Instances where masters have a low probability of providing correct responses may indicate that the linkage level does not measure what it is intended to measure, or that students who have mastered the content select a response other than the key. These instances may result in students who have mastered the content providing incorrect responses and being incorrectly classified as nonmasters. This outcome has implications for the validity of inferences that can be made from results, including educators using results to inform instructional planning in the subsequent year.

Using the 2022–2023 operational calibration, Figure 5.1 depicts the conditional probability of masters providing a correct response to items measuring each of the 815 linkage levels. Because the point of maximum uncertainty is .50 (i.e., equal likelihood of mastery or nonmastery), masters should have a greater than 50% chance of providing a correct response. The results in Figure 5.1 demonstrate that the vast majority of of the linkage levels (n = 799, 98%) performed as expected. Additionally, 93% of linkage levels (n = 760) had a conditional probability of masters providing a correct response over .60. Only a few of the linkage levels (n = 3, <1%) had a conditional probability of masters providing a correct response less than .40. Table 5.3 presents the three linkage levels with a conditional probability for masters providing a correct response of less than .40, with the Successor linkage level being the most prevalent.

Figure 5.1: Probability of Masters Providing a Correct Response to Items Measuring Each Linkage Level for the 2022–2023 Administration Year

Table 5.3: The Number and Percentage of Linkage Levels where Masters Have a Conditional Probability Less than .40 for the 2022–2023 Administration Year
Linkage level n (%)
Initial Precursor 0   (0.0)
Distal Precursor 0   (0.0)
Proximal Precursor 0   (0.0)
Target 1 (33.3)
Successor 2 (66.7)

5.3.2 Probability of Nonmasters Providing Correct Response

When items measuring each linkage level function as expected, nonmasters of the linkage level have a low probability of providing a correct response. Instances where nonmasters have a high probability of providing correct responses may indicate that the linkage level does not measure what it is intended to measure, or that the correct answers to items measuring the level are easily guessed. These instances may result in students who have not mastered the content providing correct responses and being incorrectly classified as masters.

Figure 5.2 summarizes the probability of nonmasters providing correct responses to items measuring each of the 815 linkage levels. There is greater variation in the probability of nonmasters providing a correct response to items measuring each linkage level than was observed for masters, as shown in Figure 5.2. While the majority of the linkage levels (n = 667, 82%) performed as expected, nonmasters sometimes had a greater than .50 chance of providing a correct response to items measuring the linkage level. Although most of the linkage levels (n = 487, 60%) have a conditional probability of nonmasters providing a correct response less than .40, 48 (6%) have a conditional probability for nonmasters providing a correct response greater than .60, indicating there are many linkage levels where nonmasters are more likely than not to provide a correct response. This may indicate the items (and the linkage level as a whole, since assuming the items within a linkage level are fungible means there is only one item intercept and main effect for all items within the linkage level) were easily guessable or did not discriminate well between masters and nonmasters. Table 5.4 presents the 48 linkage levels with a conditional probability for nonmasters providing a correct response of greater than .60, with the Successor linkage level being the most prevalent.

Figure 5.2: Probability of Nonmasters Providing a Correct Response to Items Measuring Each Linkage Level for the 2022–2023 Administration Year

Table 5.4: The Number and Percentage of Linkage Levels where Nonmasters Have a Conditional Probability Greater than .60 for the 2022–2023 Administration Year
Linkage level n (%)
Initial Precursor   0   (0.0)
Distal Precursor   2   (4.2)
Proximal Precursor   3   (6.2)
Target 16 (33.3)
Successor 27 (56.2)

5.3.3 Item Discrimination

The discrimination of a linkage level represents how well the items are able to differentiate masters and nonmasters. For diagnostic models, this is assessed by comparing the conditional probabilities of masters and nonmasters providing a correct response. Linkage levels that are highly discriminating will have a large difference between the conditional probabilities, with a maximum value of 1.00 (i.e., masters have a 100% chance of providing a correct response and nonmasters a 0% chance). Figure 5.3 shows the distribution of linkage level discrimination values. Overall, 71% of linkage levels (n = 578) have a discrimination greater than .40, indicating a large difference between the conditional probabilities (e.g., .75 to .35, .90 to .50, etc.). However, there were 14 linkage levels (2%) with a discrimination of less than .10, indicating that masters and nonmasters tend to perform similarly on items measuring these linkage levels. Table 5.5 presents the 14 linkage levels with a discrimination of less than .10, with the Target linkage level being the most prevalent.

Figure 5.3: Difference Between Masters’ and Nonmasters’ Probability of Providing a Correct Response to Items Measuring Each Linkage Level for the 2022–2023 Administration Year

Table 5.5: The Number and Percentage of Linkage Levels with Low Discrimination for the 2022–2023 Administration Year
Linkage level n (%)
Initial Precursor 0   (0.0)
Distal Precursor 2 (14.3)
Proximal Precursor 1   (7.1)
Target 6 (42.9)
Successor 5 (35.7)

5.3.4 Base Rate Probability of Class Membership

The base rate of class membership is the DCM structural parameter and represents the estimated proportion of students in each class for each EE and linkage level. A base rate close to .50 indicates that students assessed on a given linkage level are, a priori, equally likely to be a master or nonmaster. Conversely, a high or low base rate would indicate that students testing on a linkage level are, a priori, more or less likely to be masters, respectively. Figure 5.4 depicts the distribution of the base rate probabilities. Overall the distribution is roughly normal, with 70% of linkage levels (n = 567) exhibiting a base rate of mastery between .25 and .75. This indicates that students are most likely to be assessed on linkage levels where they have an approximately equal likelihood of mastery. On the edges of the distribution, 107 linkage levels (13%) had a base rate of mastery less than .25, and 141 linkage levels (17%) had a base rate of mastery higher than .75. For the linkage levels that do not have an approximately equal likelihood of mastery, this suggests students are more likely to be assessed on linkage levels they have mastered than those they have not mastered.

Figure 5.4: Base Rate of Linkage Level Mastery for the 2022–2023 Administration Year

5.4 Conclusion

In summary, the DLM modeling approach uses well-established research in Bayesian inference networks and diagnostic classification modeling to determine student mastery of skills measured by the assessment. A DCM is estimated for each linkage level of each EE to determine the probability of student mastery. Items within the linkage level are assumed to be fungible, and estimated with equivalent item probability parameters for masters and nonmasters, owing to the conceptual approach used to construct DLM testlets. An analysis of the estimated models indicates that the estimated models have acceptable levels of absolute model fit and classification accuracy. Additionally, the estimated parameters from each DCM are generally within the optimal ranges. We use the results for model fit, classification accuracy, and estimated parameters to continually improve upon the DLM assessments (e.g., improve item writing practices, retire testlets contributing to misfit).