Style Selector
Layout Style
Boxed Background Patterns
Boxed Background Images
Color Scheme


STEEP – Research

The STEEP model has a solid research base with multiple studies in the top peer reviewed scholarly journals. Hence, STEEP goes beyond being merely “research-based.” For background on the difference between peer reviewed research and being merely research based click here. STEEP is a program that includes several components (e.g. intervention, progress monitoring, etc.). It was built by researchers on a solid foundation of research. Hence, in creating or selecting components, it was important that each part of the program have strong supporting. However, merely because each part is effective, does not mean the system as a whole would produce meaningful outcomes with students. Hence, the process as a whole has also been evaluated. Most of the studies reviewed below are published in peer reviewed scholarly journals and meet the criteria for scientifically based research.

Review of Research Support for the STEEP RTI Program

In describing the research base for STEEP, the information will be divided into (a) the research support for the program as an integrated process and (b) the research support for the individual components.

Research on the Model as a Whole

Researchers have investigated and evaluated the STEEP model as a whole. This type of research is needed because it is possible to select the very best screening procedures, the very best progress monitoring procedures, and the best research based intervention and still not produce good outcomes because the various components do not work well together to produce good academic outcomes for students.

Improving referral accuracy. The goal of RTI is to improve achievement. However, as a result of screening and improved achievement, the need for special education is reduced. With respect to reducing referrals, it is more precise to say that STEEP increases the accuracy of referrals. Amanda VanDerHeyden (University of California-Santa Barbara, see Note 1) and her colleagues (Vanderheyden et al, 2003) studied students in the Southeastern U.S. and used a comprehensive assessment and intervention process to establish a “gold standard” as to whether a child truly did or did not have a problem. For the purpose of identification of students for special education, Teacher referral was accurate 19% of the time in 406 cases, whereas the STEEP process was approximately 3 times more accurate than teacher referral and various other screening methods and tests. Given that teacher referral is so important in a traditional problem-solving model, the researchers concluded that data-based decision making involving universal screening plays an important role in determining who needs assistance. In particular, given the finding that teacher referral was accurate less than 20% of the time and given the importance assigned to teacher judgment, it seems important to take a closer look at a broader range of variables. In addition, this same study found that classroom context significantly and negatively impacted the accuracy of teacher referral. That is, teachers became much less accurate at identifying students who did and did not have a problem in both low-achieving and high-achieving classrooms (as compared with “normally” achieving classrooms), whereas STEEP maintained or achieved even greater accuracy across those contexts

More recently, VanDerHeyden et al (2007, Journal of School Psychology) studied students in the Western United States and showed reductions in referrals and improvements in achievement as STEEP was sequentially introduced across 5 schools (one by one) within one district. They also found that the quality of the referral increased. That is, students who did not respond to the STEEP program were more likely to qualify for special education but fewer students were referred. The program had a generally positive effect with respect to disproportionality in terms of ethnicity and language proficiency. This paper was awarded best scholarly article of the year in 2008 by the Society for the Study of School Psychology.

Over-identification and disproportionality. Vanderheyden and Witt (i2005, School Psychology Review) examined the effect of STEEP in situations where there were either many high achieving students or proportionately high numbers of low achieving students. The findings indicated that teacher referral is markedly affected by the situation. For example, the “low” student in a high achieving classroom may get referred even though the “low” student is still in the normal range. However, they “stand out” to the teacher because they are low relative to high performing peers. STEEP places an objective lens on the situation and is much more accurate regardless of context. A very interesting finding in this study was that minority children (who were primarily African American) were disproportionally represented as “low achievers” and fell into the bottom 25% of classes. However, the minority students were more likely to have rapid acceleration of learning when given a strong intervention. The researchers hypothesized that the quality of the intervention used may have been more in line with the needs of minority students than was their core curriculum.

Improving achievement in general education. VanDerHeyden and Burns (2005, Assessment for Effective Intervention) found that STEEP intervention procedures produced statistically significant gains in math performance for at-risk students. This study, along with VanDerHeyden et al 2006 will be of interest to principals and teachers because the studies show the importance and relevance of RTI to general education. VanDerHeyden and Burns found that CBM assessment and intervention produced significant achievement gains in math and produced statistically significant improvements in state testing scores on the Arizona state test. RTI is best viewed as an instructional model and these studies show that RTI can produce gains for all students. A “side effect” of improved achievement is reduced need for special education and reduction of problems such as disproportionality. Disproportionality and over-referral are problems that are reduced when achievement is improved.

Research on the Components of STEEP

In addition to being evaluated as an integrated process, the various components of STEEP, (screening, intervention and progress monitoring) within the program have undergone separate testing. Each of the components will be discussed separately.

Universal Screening and Progress Monitoring

Universal screening and progress monitoring procedures have undergone extensive testing by a large number of researchers. These procedures rely on curriculum-based measurement which has been around for many years and hundreds of studies have supported its use in decision-making. Books by Shinn (1985) or Shapiro (1986) provide detailed reviews of this extensive literature. Passage leveling. The STEEP benchmark assessment probes have been developed using a 3-step process. First, we purchased a research database containing 5 million words. The words came from an exhaustive study of several thousand books read by students at various grade levels. The study reported frequencies of all words used for each grade level. From the list of high frequency words for each grade level, words of high and medium frequency were selected. STEEP probes were then written with those words. Second, the probes were checked for readability. Spache readability was used for grades 1-3 and Dale-Chall readability was used for Grades 4 up. These readability procedures have been shown to be the best for the specific grades. Finally, the probes were teacher tested and researcher evaluated. This process produces a probe that has high generality. This means that because the words used are based upon the words students see every day, then their ability to read STEEP probes is highly predictive of how well they will read every day. A probe that is too difficult lacks sensitivity and, in screening, will over identify students as in need of intervention. Most forms of CBM are adequate but some are better than others. Reports of DIBELS, for example, indicate that the first Oral Reading Fluency probe that students encounter at First Grade has a Spache readability of 2.5 (meaning middle of second grade reading level). If a probe is too difficult then student will read less words and more students will be identified as “at risk.” This will mean a school will identify more students as needing intervention thus consuming more school resources.

Readability. As noted above, each of the STEEP benchmark and progress monitoring passages is leveled use a readability formula. At one point we began to suspect that readability may not be the best method to level a passage and we became interested in studying readability formulas. Scott Ardoin led a study (Ardoin, Suldo, Witt, Aldrich, & McDonald, 2005) in which we evaluated the accuracy of many different methods for estimating passage readability. The research indicated that our concerns were justified in that many readability methods are not very accurate. We continue to use readability in constructing our passages, however, we have bolstered that using the methods described above. We also continue to look for improved methods of leveling a reading passage. The use of a readability formula is unlikely to ever yield an accurate estimate for any one student. This is because readability is more or less a norm referenced concept. In general, most fourth grade students can read the “horse.” However, a particular student may have difficulty with that word. Therefore, passages leveled with a readability formula will be appropriate “in general” but there will be some individual differences between students.

Three vs. One Benchmark Probe. With universal screening, some systems utilize a process whereby each student receives three benchmark probes and the median is taken. With STEEP, only one benchmark probe is used. This combined with more efficient administration procedures means that STEEP takes less than one-third the time of DIBELS and other procedures. However, the question is, do you still get valid results using only one probe? A study by Scott Ardoin (University of South Carolina), Joe Witt and colleagues (which was awarded School Psychology Review article of the year by the Editorial Board for the journal in 2005) indicated that one probe yields equivalent results to three probes.  Other major screening and progress monitoring products have subsequently converted to one probe for screening.

It should be mentioned, however that one probe is not sufficient for progress monitoring. One probe results in too much “bounce” in the data and this is very problematic when intervening and making important decisions about progress in the context of RTI. STEEP therefore incorporates three probes for progress monitoring.

Can’t Do/Won’t Do Assessment. The can’t do/won’t do assessment has also undergone research and evaluation. Peer reviewed published research has been published in scholarly journals to support its use. Duhon, Noell, Witt et al (2004) found that the procedure correctly identified the correct intervention to use in all cases. An earlier study by Noell, Gansle, Witt and Colleagues (1998) yielded similar results. Additional information about these procedures is available: Noell, G. H., Gansle, K. A., Witt, J. C., Whitmarsh, E. L., Freeland, J. T., LeFleur, L. H., Gilbertson, D. A. & Northup, J. (1998). Effects of contingent reward and instruction on oral reading performance at differing levels of passage difficulty. Journal of Applied Behavior Analysis, 31, 659-664) or Duhon. G. J., Noell. G. H., Witt J. C., Freeland. J. T., Dufrene. B. A., & Gilbertson, D. N. (2004). Identifying academic skills and performance deficits: The experimental analysis of brief assessments of academic skills. School Psychology Review. 33, 429-443).

Progress Monitoring

The STEEP intervention progress monitoring system involves setting a goal, drawing an aimline and monitoring progress relative to standard decision rules. The intervention manual provides suggested progress goals for setting modest, reasonable and ambitious goals. These goals were based upon an integration of published studies pertaining to student progress. Once progress monitoring begins, the rate of progress for an individual student is evaluated relative to the aimline using standards established and recommended by Stan Deno of the University of Minnesota, as well as, end of year percentile standards.

Intervention Selection

Within RTI, there are two basic approaches: the problem solving approach and the standard protocol model. The problem solving model calls on the team, through discussion and brainstorming, to identify student needs and to determine an appropriate intervention. With the standard protocol model, each step including the selection of appropriate intervention is guided by research based decision rules. Using research to guide decisions means that you connect an academic problem with an intervention that is known to be effective for that problem. This increases the likelihood that the correct intervention is matched with each problem. The STEEP program uses a standard protocol method. The STEEP standard protocol method uses data to recommend a specific intervention to match the student’s unique needs. To implement a standard protocol for intervention selection, one needs an instructional model, an assessment that determines student status within the instructional model, and research that show that students with a specific status in the instructional model improve more with specific interventions and improve less or not at all with other interventions. STEEP intervention selection is based upon an instructional model called the Instructional Hierarchy. The research in support of the STEEP standard protocol is based upon research by Duhon (2006) who found that use of the intervention indicated by the standard protocol was markedly superior to using an intervention which was not matched via protocol to student needs. Numerous other supporting studies have been conducted by Ed Daly and his colleagues.

Intervention Fidelity

Intervention fidelity means that the intervention is implemented as intended. The STEEP program incorporates an implementation protocol to enhance fidelity. This protocol was created by Joe Witt who has published over 30 research studies on intervention implementation in schools. The protocol incorporates 3 practices that have been shown to improve implementation of school based interventions:

(a) an implementation protocol to help prevent problems,
(b) monitoring of fidelity using permanent products, and
(c) periodic review and performance management

STEEP Outcomes

The most important outcome of RTI is to improve student achievement. Hence, the RTI process for a district must have a clear focus on student outcomes. There are many tools available to assist districts to conduct RTI. However, some tools focus only on screening and progress monitoring. Screening and progress monitoring are parts of RTI but they are merely assessment. There is an old saying that “Weighing a cow does not make it fatter.” Weighing does not help the cow to grow; eating grass does help the cow to grow. Similarly, assessment does not improve achievement–instruction improves achievement. When a district selects an RTI model, it was important to us that we are able to have research to say “YES” to the following questions:
Does your RTI program improve achievement of students in general education?
Does your RTI program reduce the need for special education placement?
Does your RTI program help you select an appropriate intervention to respond to unique student instructional needs?
Does your RTI program have a positive effect on disproportionality in special education?
Is your RTI program reliable and valid and does it increase the accuracy of the referral and placement process over traditional methods such as teacher referral as well as traditional screening and testing.
See above for a review of STEEP outcomes.

Research in Middle and High School Assessments


Reading Maze

The maze assessments both have a small but growing body of published literature supporting their psychometric adequacy. Existing studies on other instruments provide indirect evidence of the validity of the STEEP tools. In addition, iSTEEP has conducted studies with our own published tools that provide direct evidence of psychometric adequacy. The studies indicate that the assessments meet or exceed accepted standards for reliability and validity. The assessments have benchmarks which derive from a national norming process and ROC (receiver operator curve) analyses to set benchmarks that maximize classification accuracy (i.e., the process of classifying students as needing Tier 2 intervention).

The iSTEEP maze is a measure of basic reading performance and comprehension, which is administered to students individually or in groups. Maze requires a student to read text wherein words have been omitted. For maze, the student is asked to select, in a multiple-choice format, the one word that best completes the sentence. Students have three minutes to complete the maze. A student’s sentence maze score comprises the total correct words marked in three minutes. The maze items were created from grade level text and are designed to place moderate demands on comprehension. They require that the student be monitoring closely the meaning of what they read.

The reliability of the assessments has been evaluated using a variety of methods and procedures. For the maze assessment, reliability has been assessed using the test-retest, internal consistency and the alternate form method. The retest method calls for the same probe to be administered twice and the results are then correlated. In the alternate form method, students are administered different forms, within the same grade level, and the results of the two forms are correlated. In one study using the STEEP probes, 350 (i.e., approximately 50 students at each grade level 6-12) students were included in a study of both test-retest and alternate form reliability. This study was conducted with students in two states. In this study, students in Grades 6-12 were first administered six alternative forms of a grade appropriate probe. These forms were all administered on the same day in 1-2 trials (i.e., for some students, because of attention span and/or fatigue, administration of the six forms was spread across two opportunities within the same day). The same six forms were re-administered on a second occasion within 5-7 days of the first occasion. This provided data on the same forms administered on different occasions and alternate forms administered on both the same occasion and different occasions. The median test-retest reliability (i.e., the correlation between same forms administered to the same students on different occasions) across the grades was .90. Median alternate form reliability was .87. Using Cronbach’s Coefficient Alpha, a measure of internal consistency, the median reliability coefficient was .92 across the various measures.

Validity studies on the STEEP probes have centered around the predictive and current validity of the probes using external criterion measures. Generally the criterion has consisted of either a state accountability test or a well known achievement test such as the Woodcock Johnson or other nationally standardized achievement test.

Results of the validity studies support the validity of the maze probes. Using state tests as a criterion, median concurrent validity coefficients for grades 7-9 were in the middle to middle .60’s while the median predictive validity coefficient was in the low .60’s. When using standardized achievement tests as the criterion, median concurrent validity coefficients were in the mid .60’s. In one study of students in grades 9-12, median concurrent validity with a nationally standardized achievement test was .54.

Advanced Literacy

The iSTEEP Advanced Literacy assessment is a standards-based assessment that provides a rigorous appraisal of reading comprehension.  All items are text based and derive from text and stories that are well crafted. The assessments require students to read and understand original sources “worth reading” from, most from well-known authors. In addition to literary works, the assessment also includes published informational texts written by authorities from diverse areas such as biography, history, science, and American government, etc.) Texts were selected carefully to meet rigorous evidence-based standards.

A primary consideration in evaluating standrds-based assessments is content validity.  In an independent evaluation of content validity conducted by the Louisiana Department of Education, iSTEEP received a Tier 1 rating indicating “Exemplies Quality” and met all non-negotiable criteria and scored the best possible score on all indicators of superior quality. The review examined multiple indicators of the extent to which the assessment content was standards aligned and the assessment items fulfilled the scope and intent of the standards.

Benchmarks and Norms

To assist in the interpretation of the STEEP probes, benchmarks are provided which guide users with identifying students who may need intervention. While national norms are also available, using benchmarks for interpretation is usually more meaningful. Whereas norms help users evaluate performance relative to a normative group, the benchmarks are calibrated with reference to important criterion assessments such as performance on a state test.

The studies to derive benchmarks involved examining classification and data from ROC (receiver operator curve) analysis. These types of studies also represent a type of validity study. With all types of validity, the following question is important: valid for what purpose? We seek validity for the purpose of screening where the goal is to accurately “rule out” those students who don’t have a reading problem. We also seek respectable levels of sensitivity and specificity. To determine cut points with these goals in mind, we conducted a series of studies using ROC analysis. These studies required two sets of scores. First, STEEP benchmark screening scores were collected from universal screening. Second, state test scores were obtained for the same students along with the state standard for satisfactory/unsatisfactory performance. Cut scores were set in these studies by examining ROC curves, AUC, sensitivity/specificity, as well as negative and positive predictive power. Classification indices were calculated for several ranges of ORF cut scores to observe changes that occurred when scores were adjusted along the horizontal and vertical axes on the ROC curve. A priority was to minimize false negatives but with consideration to false positives. For example, it is possible to eliminate false negative errors in setting cut scores. However, in most cases, allowing a small false negative error rate allows for a marked reduction in false positive errors. In short we wanted to minimize false negatives without causing false positives to become unwieldy.

The classification accuracy studies involved samples of approximately 100 students at each grade and were conducted using a nationally normed achievement test or state accountability test as a criterion. The results of the studies suggest the classification accuracy of the assessments was adequate for purposes of universal screening. Across the grades, the median index for specificity was .91 and the index for sensitivity was .72. Median area under the curve for ROC analysis was .72.

National norms have also been computed. Samples consisted of approximately 3200 students per grade across 12 states. Using the STEEP data system, it is possible to compute percentiles and other derived scores.


In summary, the reliability and validity of the STEEP maze probes is considered adequate for purposes of screening and generally exceed generally accepted standards. Classification accuracy is considered good indicating that the STEEP probes can accomplish their primary function which is to accurately classify students needing or not needing intervention.

Middle and High School Math Assessments

For Grades 6-12, iSTEEP has 2 primary measures for mathematics: Advanced Numeracy and Math Concepts and Applications. The Advanced Numeracy assessment assesses 3-5 key skills for each grade. These skills are fundamental for the grade and serve as a research based foundation for grades beyond the grade being assessed. The reliability and validity of the assessments has been researched by iSTEEP and by independent researchers. In test validation studies, iSTEEP found a version of the Advanced Numeracy to have moderate to high correlations (.65 to .80) with a state tests scores as the outcome measure in a variety of states. Independent researchers including Codding and Connell (2010, from University of Massachusetts and the University of Pennsylvania) found that a version of the Math Common Core had correlations with the MCAS (Massachusetts state test) that ranged from .606 for sixth grade to .618 for 8th grade. The reliability of the Common Core assessment was also investigated and reliability estimates, using coefficient alpha, exceeded .90 for all grades. Since the focus of iSTEEP assessments is screening, there are a core group of skills that remain constant across state content standards and evolving national practices. As states solidify their assessments around Common Core, additional validity studies may be needed. In the meantime, districts should review iSTEEP assessment to insure that they meet district and state content guidelines.

The Math Concepts and Application assessment taps key math concepts in grades 6-12. It is designed as a screening assessment to assess math skills and pre-requisite skills for middle and high school students. Students found to be missing these fundamental skills are deemed “in need of intervention”. Research has indicated that these assessments have good concurrent and predictive validity using state test scores as a criterion with validity coefficients for middle school students in the mid to up .70’s and for high school students in the lower .60’s to .70’s. This assessment was also evaluated by independent researchers Codding and Connell (2010) who found that validity with state test scores (MCAS, Massachusetts state test) was good to excellent. Validity coefficients ranged from .61 for sixth grade to .80 for eighth grade. Research has indicated the reliability of the Math Concepts and Applications assessment is consistently above .85 and more frequently above .90. Reliability was estimated using coefficient alpha. In summary, the math assessments for grades 6-12 meet generally accepted scientific standards for reliability and validity.


The iSTEEP assessments and decision rules have undergone extensive reseach, testing and validation. They have also been tested extensively in schools for many years. Nevertheless, we suggest that iSTEEP tools be used as part of a process that includes a variety of assessments with the data reviewed and decisions made by a team of professionals who reach a consensus on important decisions after considering all available data. Important decisions should never be made based upon any single score.

In the early literature on STEEP, parts of the process have been variously labeled Problem-Validation Screening, Screening to Enhance Equitable Placement, and Screening to Enhance Educational Performance. All of these names refer to the same process.