No field or profession can prescribe what is or isn’t validity or efficacy evidence. Apart from accuracy, what is convincing evidence is a personal decision. This isn’t news. Stephen Toulmin, Samuel Messick, and Michael Kane have rejected the use of universal criteria to evaluate validity and efficacy evidence. Different fields have distinctive ways of asking questions, addressing a literature, criticizing ideas, and presenting arguments. The differences make sense from a socio-cultural perspective as each of these stakeholders—you might call them customers if you are in marketing or users if you are in product development—is coming from a different culture of evidence and set of life experiences. But this seems chaotic if you are trying to collect validity or efficacy evidence and construct an argument with it.
Examples of Different Cultures of Evidence
New market data on educators’ buying decisions provides an example of what is sensible from a socio-cultural perspective but chaotic from a validity/efficacy argument perspective District and school leaders were asked which of their peers strongly influenced their buying decisions. Most said they were influenced by their peers in their own school system. But about 1 in 4 said their buying decisions were influenced by feedback from peers in districts within 50 miles or their own. How does a company collect personal testimonials from nearby peers and build that evidence into validity/efficacy arguments? Does it make sense to try?
Though labeled market data, the findings on educators’ buying decisions reminded us of a study we did several years ago. We asked psychometricians, teachers, parents, policymakers, and lawyers to “evaluate how relevant the evidence is to you.” Our stakeholders could say “Not at all (1),” “Somewhat (2),” “Fairly (3),” “Very (4),” and “Not Sure.” The 27 examples of evidence were about test content, response process, internal structure, relationship to other variables, or consequences of testing. The scenarios, that varied in terms of stakes, were (1) college entrance examination (e.g., high-stakes for the student), formative assessment (e.g., low-stakes for both the student and the teacher), and teacher evaluation (e.g., high-stakes for the teacher). We found that how relevant our stakeholders felt the evidence was to them depended on their role, the scenario, and the kind of evidence.
- Psychometricians tended to rate all evidence as less relevant than other stakeholders. But most relevant for psychometricians was evidence looking at the predictive nature of the assessment with related long-term outcomes and showing how items perform for students of similar abilities who are of different racial/ethnic backgrounds (i.e., differential item functioning).
- Lawyers felt the most relevant evidence was evidence from students with disabilities, English learners, and different racial/ethnic backgrounds. Most relevant for lawyers was evidence from interviews with small groups of students with and without disabilities to determine if students are using the same strategies in testing. Interviews with small groups of English learners and non-English learners to determine if students are using the same strategies in testing. Studies showing how items perform for students of similar abilities who are of different racial/ethnic backgrounds (i.e., differential item functioning).
- Policymakers felt the most relevant evidence was evidence that item writers follow an item writing guide, evidence of how scores from the assessment are used to evaluate programs, andstudies looking at differential item functioning.
- Teachers felt the most relevant evidence was studies looking at differential item functioning and evidence that items are written by current and former educators, following an item writing guide.
- Parents were the only stakeholders who felt evidence of parent attitudes towards the assessment was highly relevant.
We know that many of our participants probably shared membership in more than one stakeholder group. We would have loved to know if, for example, an individual’s pattern of responses might have differed if asked to respond “as a lawyer” or “as a parent.” Responding as a lawyer, the evidence of parent attitudes towards the assessment might have been weakly relevant, while for the same individual responding as a parent, parent attitudes might have been strongly relevant.
We have evidence requirements under the Every Student Succeeds Act and the educational research in the What Works Clearinghouse to tell us the quality of existing evidence in educational programs, products, practices, and policies. After all, the goal of the WWC is to help make evidence-based decisions. But a statistically significant positive effect from a study with data from more than 350 students collected at two sites in a randomized controlled trial seems less relevant to the concerns of an educational administrator than personal testimonials from nearby peers working in schools and in communities like theirs. Statistical significance doesn’t capture the concerns of a parent interested in attitudes toward the product from other parents dealing with struggling children like theirs. The evidence from educational research in the What Works Clearinghouse is just a narrow slice of the potential evidence educational administrators and parents find relevant. And maybe not the most convincing evidence.
Development of Validity/Efficacy Arguments
How do we develop validity/efficacy arguments that are relevant for different stakeholders? One option is to annotate graphs and outlines, borrowed from the legal literature as we describe in our blog, Effectively Developing a Validity Argument, to include the validity/efficacy evidence that different stakeholders find relevant and convincing. Figure 1 shows a fragment of a graph, used in that blog, of the microstructure of the validity argument for the interpretation and use of scores from the AP World History test. At a glance, you can see teachers felt highly relevant evidence was that all items were written by teachers with at least five years of experience in the grade band and that teachers were trained on and followed the item writing guide. But none of the five stakeholders found relevant two pieces of evidence–models of teacher pedagogical content knowledge and studies showing teachers with at least 5 years of classroom experience have well-developed pedagogical knowledge.
Figure 1. A fragment of the microstructure of the validity argument for the interpretation and use of scores from the AP World History test.
Graphs and outlines are difficult to understand in either graph or outline form. These have to be further developed into narratives, thesis statements, and themes, as we describe in Effectively Communicating a Complex Validity Argument and Themes and Theses: More Tools to Effectively Communicate a Complex Validity Argument, to effectively communicate the validity/efficacy argument to the customers or stakeholders. But the evidence comes first.
Bad News for Companies
You’re too late if you start thinking about planning and collecting evidence relevant to stakeholders after the assessment or product has been developed. The “market research” must be done first to know what evidence is important and for whom. For example, a national survey of K-12 administrators, principals, and teachers asked: When a company has failed to secure teachers’ buy-in for a product in your district or school, what are the most common reasons for that failure? The most common reason, given by 42%, was that teachers were not directly involved in planning product implementation. The importance of teacher involvement in product design and development has to be known early and evidence planned for and collected from the outset.
This seemingly chaotic view of validity and efficacy might be bad news for traditional learning and assessment companies. A socio-cultural approach to validity/efficacy evidence blurs the lines and requires cooperation across departments—marketing, product development, IT, and research. The evidence relevant to stakeholders must be coordinated across business models, product development roadmaps, workflows, and research designs. This kind of cooperation and coordination is often lacking in company culture but is essential to provide the types of evidence stakeholders want and need to trust the product.