With more organizations focusing on metrics, the MCC has received an increase of questions ranging from how to use metrics, why some metrics are better than others, which type of metrics is best to use, as well as questions about specific MCC Metrics. This column provides a forum for us to share these questions and answers with you
Critical Success Factor (CSF): CSFs describe the desired outcome(s) of a process or relationship. For our TMF metrics, for example, the CSF is “The TMF contains artifacts that are of high quality and that are stored in the expected manner at all times.” This helps remind users of the process goal and the reason they might use the metric.
- Key Performance Question (KPQ). This is the specific question the metric is trying to answer. For example “Do the artifacts submitted for inclusion in the TMF meet the quality standard?”
- Why is the KPQ important? This shows why answering the KPQ is important for the Critical Success Factor of the process or relationship. It also describes the actions that might be taken based on the results of the metric.
- What the metric does not tell you. Metrics are very focused – they measure a specific aspect of a process or relationship. It is important to understand what they are not measuring. Knowing this, you can be aware and watch for the misuse of a metric and possible unintended consequences of using a metric. For example, a metric that focuses on quality might shift focus away from timeliness. By balancing a quality metric with a timeliness metric, we can avoid the negative impacts.
- Companion metrics. These are metrics we recommend you should consider using in addition. They look at other, related dimensions of the process such as timeliness or cycle time. They are grouped according to their level – whether site, country, study or portfolio level.
- Suggested performance target. This is to help users know what level of the metric MCC member organizations aim for. It helps put the results of a metric into context. (See Ask-the-Expert column “What is the purpose of the ‘performance targets’ listed in MCC metrics?”)
The MCC makes these additional details available to member organizations to assist with the implementation and proper use of the metrics.
As we mentioned in the previous answer, the underlying assumption of a T-test is that the data is normally distributed (in a “bell”-shape). Unfortunately, this assumption is often not true. You can check using a test such as a QQ plot.1 In survey responses, for example, it’s possible you may have a bimodal distribution (see CRO data in Figure 1). If there is a small, but significant group who strongly disagrees with a statement, then that might be an avenue to explore further; why do they strongly disagree when most agree or strongly agree? Using a T-test to compare an average value in this case would miss this important information.
Figure 1: Fictitious survey response data
Because of this concern, you should look at how the responses are distributed before using a T-test. The MCC Vendor Oversight Work Group has been looking at this issue and has decided to focus on the percentage that agree or strongly agree, the percentage that disagree or strongly disagree, as well as the average score.
How do you determine if there is a significant difference when you compare the percentage that disagree or strongly disagree between sponsor and CRO responses?
You can use a test of proportions. You need to know the following information for both sponsor and CRO responses:
- Percentage who stated “strongly disagree” or “disagree”
- Number of respondents
In our example, only two out of 38 (5%) of sponsor respondents and 14 out of 65 (22%) of CRO respondents disagreed or strongly disagreed. You can use a simple online tool (or a statistical package) to determine if there is a statistical difference.2 As with the T-test, you are looking for a p-value of less than 0.05. If you have that, then it is likely there is a difference and you have a statistical difference at a 95% confidence level. Figure 2 shows a p value is 0.0229, so less than 0.05. Thus, there is a statistical difference between the sponsor and CRO respondents for this question.
Figure 2: Test of Proportions example
The Test of Proportions is an example of a so-called non-parametric test; it does not assume that the underlying data is normally distributed. Other examples include tests comparing medians such as the Mann-Whitney Test.
The statistical tests help you understand what to focus on – what is likely to be noise and what is likely to be a real difference. Once you determine that there is likely to be a real difference, you need to decide whether the difference is enough for you to be concerned. If so, you should try to understand why the difference is there and what you can do to change it in the future. This is putting metrics (and statistics) to use.
- Real Statistics Using Excel – QQ plot. Available at http://www.real-statistics.com/tests-normality-and-symmetry/graphical-tests-normality
- Test of Proportions On-Line Calculator (free) https://www.medcalc.org/calc/comparison_of_proportions.php
As mentioned in the answer to the CRO performance comparison question, there are a number of considerations to make sure you are comparing apples to apples. A comparison using a statistical test such as Student’s T-test (normally simply referred to as a “T-test”) determines whether it is likely that there is a significant difference between the means of the two groups of data. In the example, CRO A had a mean (and also a median) of nine calendar days and CRO B had a mean (and a median) of seven calendar days – but the spread of the data was much greater for CRO B. Figure 1 shows CRO A and CRO B data entry cycle time data as a histogram.
Figure 1: Histogram of Data Entry Cycle Time
The T-test has a key assumption and if your data meets that assumption, it is a very powerful test to determine if there is a statistically significant difference between two sets of independent data. The key assumption is that your data are from a normal distribution. You can test your data to determine if it is likely to be normal (such as the QQ plot);1 if your data passes one of those tests then you can apply the T-test.
There are different versions of the test depending on whether the spread of the data in the samples is considered to be the same or different. In our example, the spread is different. Having looked at a histogram of our data and a QQ plot, we might decide the data looks as though it is from a normal distribution. We can apply a T-test to compare the means, assuming the spread (the variances) are different in the two samples. This can be done in Excel or a statistical package such as Minitab.
When we run a T-test, we get a result such as the Excel output in Table 1.
Table 1: T-test Results for CRO A and CRO B Samples
The important item to look for in this table is the P value. It is less than 0.05. In fact, it is much less. This is the so-called alpha risk; the risk of the test telling us there is a difference when really there isn’t one. The 0.05 corresponds to a 95% chance that the test result is true. So, we can be confident to at least 95% that the two means are indeed different. In other words, based on this analysis, on average, CRO B has a lower mean (average) than CRO A.
What happens when the data is not normally distributed? You have to use to non-parametric tests instead of a T-test.
1. Real Statistics Using Excel – QQ plot. Available at http://www.real-statistics.com/tests-normality-and-symmetry/graphical-tests-normality
A: This raises a lot of interesting issues. Some questions you should consider are:1. Are you comparing “apples to apples”? Are both CROs measuring the cycle time in the same way? MCC published an article and report on this in 2017 and if the CROs are measuring the cycle times themselves, they may be defining the cycle time differently.2. As you are looking at CRO performance, it is likely that you are looking at data across multiple protocols. There are different ways to aggregate data – is the same method being used for CRO A and CRO B and is it fair? For example, is the data the median cycle time across protocols? Or the median cycle time across sites within a study?3. The MCC uses median data for cycle times for reasons explained in a previous Ask The Expert question. Are both CROs using medians for the average? Using means? One using median and the other using mean?4. Protocols vary in operational complexity – are you expecting the same performance regardless of complexity? The MCC Study Quality Trailblazer Work Group has recently developed a tool to help assess protocol operational complexity.5. With your focus on centralized monitoring, not all data is of equal importance. It may make sense (if systems allow) to focus on the cycle time for getting critical data entered into EDC. Can the CROs provide cycle times for critical data?6. There are often anomalies in data and these may have been removed by each CRO. But what data was removed from the calculations and why? Could the removal of data bias the result?7. Speed of entry is important – but so is quality. Are you measuring other metrics that help provide a picture on performance such as query rate?8. The spread of the data would also be interesting to know – a shorter average cycle time with a very wide spread might be problematic, for example. There might be a number of sites / protocols whose cycle times are far outside acceptable limits. Whereas a slightly longer average with a narrow spread might be preferable (see figure and table).Figure: CRO B has a lower median than CRO A. But due to the high standard deviation (SD), CRO B has more sites/protocols with long cycle times (>12).CRO A CRO B
≤ 7 calendar days 17% 51%
7 – 9 calendar days 35% 15%
9-12 calendar days 42% 18%
≥12 calendar days 6% 16%
Table: The table shows the proportion of data within the different ranges of cycle times for CRO A and CRO B. As you can see, CRO B has more cycle times meeting the specification of seven days (51% compared with 17%) but it also has a higher proportion of cycle times greater than 12 days (16% compared with 6%).
Let’s assume you do have an “apples to apples” comparison and you have the data that led to the cycle time averages of nine for CRO A and seven for CRO B. You could now run a statistical test such as a T-test to compare the two (if the data is approximately normal). More likely, you would run a statistical test such as a Mood’s Median Test – this can be used when the data is not normal.
A: The MCC Vendor Oversight Work Group has been examining this challenge. What’s the best way to roll up data across studies? It depends on the performance questions (Key Performance Questions) you are trying to answer. Let’s look at an example related to study milestones – suppose you want to know what proportion of study milestones have been met within expected timelines. Figure 1 shows the study milestones that met timeline targets across three studies.
Figure 1: Study Milestone Report
When you look at results across all three studies, you might ask, “What proportion of milestones have been met across studies?” To answer this question, you can add up all of the milestones that met timeliness expectations across studies and divide it by the total number of milestones that have been reached (met and not met). In this example, you met nine out of 16 milestones or 56.3%. We could, then, compare this with the same metric for studies from another CRO. [Table 1]
Another approach would be to use a target percentage and assess each study in a portfolio to determine if it reached the target – say 75%. In this case, you are seeking to answer a different question, “What proportion of studies met at least 75% of their milestones?” To answer this question, you would examine the proportion of studies that met the 75% target. In this example, only one of the three studies met the 75% threshold (study 2) so the result is 33.3%. [Table 1]
Of course, the study level view is always going to be very valuable because the next question after looking at the rolled-up metric is going to be, “What is this metric telling me? Are studies performing similarly or are there some much better than others?” The devil is in the detail after all! In our example, there are some significant differences between the studies and you might want to try to understand why. Additionally, these metrics don’t tell you the following:
- How far was each milestone missed – 1 day or 3 months?
- Why were milestones missed?
- What is the impact of missing milestones?
- Were the milestones achievable in the first place?
For example, looking across the results, you see that the first milestone (Risk Management Plan) was missed for all three studies – why might this be? Is the time allowed for completing the milestone not realistic?
With a relatively small number of studies, there may be little value in rolling up the data. But with a large number of studies, the rolled-up view can be a valuable way to summarize results for busy executives. Table 1 shows the different approaches the MCC Vendor Oversight Work Group has been investigating. For additional information, please visit the MCC Vendor Oversight Work Group webpage.
Table 1: Metric Calculation Examples
A: The type of metric you use depends on the question you want the metric to help you answer. At the MCC, we define a Critical Success Factor (CSF) and one or more Key Performance Questions (KPQs) – questions that help you determine whether you are “on track” to achieve or did achieve the CSF – before deciding which metric(s) to use. The CSF describes the overall goal or desired outcomes of the process in question. For example, a CSF for Data Management/Biostatistics processes could be: “Clean and sufficient critical data are collected and analyzed to determine the safety and efficacy of the investigational product.” Next, we define KPQs, the questions related to the CSF that we want our metrics to help us answer. These questions determine whether we should define cost/efficiency, quality, timeliness and/or cycle time metrics. The two time-based metrics – timeliness and cycle time – answer different time-related questions. Timeliness measures whether an event occurred at the expected time (e.g. “on-time”) and cycle time measures how long it takes to complete a process. Figure 1 shows an example of a CSF and some KPQs we might want our metrics to help us answer.
Figure 1: Example CSF, KPQs and associated metric types
As you can see in the example, the KPQs are different for timeliness and cycle time. But why might you want to measure a process in these different ways? Figure 2 shows a comparison between timeliness and cycle time KPQs. If you are concerned about whether a study achieved the Database Lock milestone by the planned date, you would use the timeliness metric. For repeat occurrences such as data entry, cycle time and timeliness both may be useful. You can review the percentage of data entries completed by the timeliness target or calculate the actual data entry cycle times.
Figure 2: Comparison of timeliness and cycle time KPQs for study oversight
If you are in the study-planning group, you may want to review the cycle time version of the metrics. Review of cycle times provides you with important information about how long it actually takes to complete the task so you can establish realistic assumptions in future studies.
Finally, process improvement teams often review both timeliness and cycle time metrics.
So, in answer to the question, the metric type will depend on the question you are trying to answer.
A: A histogram is another way of plotting data, like the box and whisker plot we discussed in last month’s column. [Figure 1] The box and whisker graph is used to show the shape of the distribution, the median value and variability.
Figure 1: Box and Whisker Plot of TMF Document Filing Cycle Times
The histogram graph provides you with a different view of the distribution. It groups data into like-sized buckets so you can see the distribution of the data and whether it is similar to a normal distribution curve.
By taking the raw cycle time data, we can group them into groupings or buckets of 20-day increments:
Figure 2: Cycle Times Grouped into 20 Calendar Day Buckets
Figure 2 shows that there are 13 data points less than 20 calendar days cycle time, 17 between 20 and 40 days and so on. When you plot the data using the groupings in Figure 2, you get the histogram chart shown in Figure 3.
Figure 3: Histogram of Cycle Time Values in 20 Calendar Day Buckets
The histogram view of the data allows you to easily see the following:
- The approximate spread is 120 calendar days (0-120 days)
- The most common values are in the 20-40 grouping
What should you review with this histogram chart?
Graphing your data in a histogram chart might help you to identify possible outliers and whether the data has a normal distribution (e.g. whether it’s a “bell curve”). In this example, you can see some longer cycle times in the 60 to 120 day range that create a long tail going off to the right. [Figure 3] This long tail to the right makes the graph asymmetric or right skewed – in other words it doesn’t look like a bell curve. When the curve is skewed, the mean and median values are not the same as is seen with a normal curve. This means that the median is probably a better indication of the average than the mean (see previous Ask the Expert Media vs. Mean column). In the Box and Whisker plot of the same data (Figure 1) the long tail shows as the longer “whisker” on the top of the box.
Where it can get particularly interesting is when you plot the data and find more than one peak. [Figure 4]
Figure 4: Histogram with Two Peaks
This indicates something underlying in the data such as two groups behaving differently. Perhaps one department is submitting documents with a cycle time of around 30 calendar days and another has a cycle time of around 90 calendar days. You would need to investigate the data further to see if you can understand why the histogram looks like this – a box and whisker showing the different departments might be a good start at uncovering what’s going on.
Note that the mean and median of the data for this histogram will be around 60 and yet 60 is not a good representation of the true average – because really there are two groups with their own means and medians at around 30 and 90 calendar days. If we only had the mean or median for the whole data, or only a Box and Whisker plot, we could be misled in this example. [Figure 5]
Figure 5: Figure 4 Data Displayed in a Box and Whisker Plot
Excel 2007 and higher can generate histograms using the Analysis Toolpak.
A: A box and whisker plot is one of the many ways to display data. Developed by John W. Tukey, it works well when there is a lot of data and you want to take a look across different groups. For example, you might have data on the cycle times from document finalized to document published in the eTMF (a metric that the MCC TMF Work Group has recently defined) and want to see how the metric differs by responsible party.
The box and whisker plot has the main box that extends from the first to the third quartile (IQR) with a line showing the second quartile (the median).
Figure 1: Anatomy of a Box and Whisker Plot
The quartiles work similarly to the median we discussed previously in this column. The median is the value where 50% of data is above and 50% below. Similarly, the first quartile is the value of the data point that has 25% of data points below and 75% above. And the 3rd quartile has 75% of data below and 25% of data above. The upper and lower whiskers show the full extent of the data.
The height of the box or interquartile range (IQR), also called the mid-spread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles.1
There are variations of the box and whisker that use rules to determine if particular values are outliers and they are typically shown as circles or asterisks. Also, the mean of the data is sometimes shown with a symbol such as an ‘x’. If you use Excel 2010 or later you can plot these charts easily as in the example below.
Apply the Concept
Figure 2: Box and Whisker Example
Let’s take a look at the box and whisker plots in Figure 2. Here we are showing cycle time data from document finalized to document published in the eTMF. A, B and C are different responsible parties.
How do the median values align?
The overall median is 32 calendar days. The median value of responsible party B is lower than either A or C. Although note that the medians for B and C look similar. Responsible party A has the highest median (longest cycle time).
How do the IQRs compare?
The difference between Q3 and Q1 (the height of the blue boxes) vary among the four plots. C has the largest range of cycle times and B has the smallest range. Since C has a large variation it might be worth exploring why the values of that responsible party is so variable as compared to B. Are there some document types that are being published quickly whilst others are taking a considerable time?
How do the whiskers (max and min non-outlier values) compare?
Responsible party A has the highest minimum value and C has the highest maximum value. Again, responsible party C clearly has a wide variation of cycle times.
Putting it all together
Before going further, you might want to carry out statistical tests to compare the data to see if differences are significant and perhaps review the outliers. But without doing that, the box and whisker plot has given us a strong indication that there are some real differences between the performance of the responsible parties – A, B and C. When you consider both the median value and the IQR, you can see that B is the top performer of the group. It has both the lowest median and the smallest range or variation in values. You might want to try to understand what B is doing and see if those “best practices” can be applied by A and C.
A previously posted Ask the Expect column examined the difference between median and mean values. As you can see in the above box and whisker plots, the mean (the ‘x’) is typically greater than the median (the line). This is often the case with cycle time data, which is why the median is often a better summary statistic for this type of data.
A: Each MCC metric includes a written description of the metric, a formula and a performance target. The purpose of the performance target is to establish performance expectations or a level of performance that is acceptable for that particular time, cost or quality measurement. Without a target, it is difficult to interpret the results and determine whether additional action is required – the target provides the context in which to interpret the results.
MCC Work Group participants define performance targets as part of the metric development process. Many of the MCC metrics developed by the Clinical Operations Sub-Group have green-amber-red performance levels. Results that fall into the “green zone” are good results; results in the “amber zone” are in the to be watched grouping as they fall outside of good results but don’t need immediate action steps; and results in the “red zone” are poor results that require action steps.
Some MCC metrics do not have standardized performance targets because targets varied by therapeutic area.
A: When looking to summarize data people often use the mean (also termed the common average). This works well when the distribution of data is even – looks something like a normal curve. But often data is not normally distributed in this way. This is particularly true when measuring cycle times. These tend to have a low peak and then a long tail. The long tail impacts the mean such that it can be a long way off the peak of the distribution.For example, if the cycle times for completing Monitoring Visit Reports are 5, 9, 10, 10, 10, 11, 11, 24, and 56 days, the Mean is 16.2. But when we look at the numbers, is 16.2 a good representation when 7 out of the 9 cycle times are less than 16.2? This is where the median works better. The middle value in this example is 10 and by definition, half are below and half above. [Figure 1]
There are non-parametric statistical tests that you can run comparing medians (as you can with a T-test for comparing means). They are not as powerful as the T-test but do not rely on the assumption of an underlying normal curve.
A word of caution about the term “average.” Most people use “average” and “mean” interchangeably. However, the definition is ambiguous – it can refer to the median or the mode of the data, too. Here’s a dictionary entry on the term.