With more organizations focusing on metrics, the MCC has received an increase of questions ranging from how to use metrics, why some metrics are better than others, which type of metrics is best to use, as well as questions about specific MCC Metrics. This column provides a forum for us to share these questions and answers with you
A: This raises a lot of interesting issues. Some questions you should consider are:1. Are you comparing “apples to apples”? Are both CROs measuring the cycle time in the same way? MCC published an article and report on this in 2017 and if the CROs are measuring the cycle times themselves, they may be defining the cycle time differently.
2. As you are looking at CRO performance, it is likely that you are looking at data across multiple protocols. There are different ways to aggregate data – is the same method being used for CRO A and CRO B and is it fair? For example, is the data the median cycle time across protocols? Or the median cycle time across sites within a study?
3. The MCC uses median data for cycle times for reasons explained in a previous Ask The Expert question. Are both CROs using medians for the average? Using means? One using median and the other using mean?
4. Protocols vary in operational complexity – are you expecting the same performance regardless of complexity? The MCC Study Quality Trailblazer Work Group has recently developed a tool to help assess protocol operational complexity.
5. With your focus on centralized monitoring, not all data is of equal importance. It may make sense (if systems allow) to focus on the cycle time for getting critical data entered into EDC. Can the CROs provide cycle times for critical data?
6. There are often anomalies in data and these may have been removed by each CRO. But what data was removed from the calculations and why? Could the removal of data bias the result?
7. Speed of entry is important – but so is quality. Are you measuring other metrics that help provide a picture on performance such as query rate?
8. The spread of the data would also be interesting to know – a shorter average cycle time with a very wide spread might be problematic, for example. There might be a number of sites / protocols whose cycle times are far outside acceptable limits. Whereas a slightly longer average with a narrow spread might be preferable (see figure and table).
Figure: CRO B has a lower median than CRO A. But due to the high standard deviation (SD), CRO B has more sites/protocols with long cycle times (>12).
CRO A CRO B
≤ 7 calendar days 17% 51%
7 – 9 calendar days 35% 15%
9-12 calendar days 42% 18%
≥12 calendar days 6% 16%
Table: The table shows the proportion of data within the different ranges of cycle times for CRO A and CRO B. As you can see, CRO B has more cycle times meeting the specification of seven days (51% compared with 17%) but it also has a higher proportion of cycle times greater than 12 days (16% compared with 6%).
Let’s assume you do have an “apples to apples” comparison and you have the data that led to the cycle time averages of nine for CRO A and seven for CRO B. You could now run a statistical test such as a T-test to compare the two (if the data is approximately normal). More likely, you would run a statistical test such as a Mood’s Median Test – this can be used when the data is not normal.
A: The MCC Vendor Oversight Work Group has been examining this challenge. What’s the best way to roll up data across studies? It depends on the performance questions (Key Performance Questions) you are trying to answer. Let’s look at an example related to study milestones – suppose you want to know what proportion of study milestones have been met within expected timelines. Figure 1 shows the study milestones that met timeline targets across three studies.
Figure 1: Study Milestone Report
When you look at results across all three studies, you might ask, “What proportion of milestones have been met across studies?” To answer this question, you can add up all of the milestones that met timeliness expectations across studies and divide it by the total number of milestones that have been reached (met and not met). In this example, you met nine out of 16 milestones or 56.3%. We could, then, compare this with the same metric for studies from another CRO. [Table 1]
Another approach would be to use a target percentage and assess each study in a portfolio to determine if it reached the target – say 75%. In this case, you are seeking to answer a different question, “What proportion of studies met at least 75% of their milestones?” To answer this question, you would examine the proportion of studies that met the 75% target. In this example, only one of the three studies met the 75% threshold (study 2) so the result is 33.3%. [Table 1]
Of course, the study level view is always going to be very valuable because the next question after looking at the rolled-up metric is going to be, “What is this metric telling me? Are studies performing similarly or are there some much better than others?” The devil is in the detail after all! In our example, there are some significant differences between the studies and you might want to try to understand why. Additionally, these metrics don’t tell you the following:
- How far was each milestone missed – 1 day or 3 months?
- Why were milestones missed?
- What is the impact of missing milestones?
- Were the milestones achievable in the first place?
For example, looking across the results, you see that the first milestone (Risk Management Plan) was missed for all three studies – why might this be? Is the time allowed for completing the milestone not realistic?
With a relatively small number of studies, there may be little value in rolling up the data. But with a large number of studies, the rolled-up view can be a valuable way to summarize results for busy executives. Table 1 shows the different approaches the MCC Vendor Oversight Work Group has been investigating. For additional information, please visit the MCC Vendor Oversight Work Group webpage.
Table 1: Metric Calculation Examples
A: The type of metric you use depends on the question you want the metric to help you answer. At the MCC, we define a Critical Success Factor (CSF) and one or more Key Performance Questions (KPQs) – questions that help you determine whether you are “on track” to achieve or did achieve the CSF – before deciding which metric(s) to use. The CSF describes the overall goal or desired outcomes of the process in question. For example, a CSF for Data Management/Biostatistics processes could be: “Clean and sufficient critical data are collected and analyzed to determine the safety and efficacy of the investigational product.” Next, we define KPQs, the questions related to the CSF that we want our metrics to help us answer. These questions determine whether we should define cost/efficiency, quality, timeliness and/or cycle time metrics. The two time-based metrics – timeliness and cycle time – answer different time-related questions. Timeliness measures whether an event occurred at the expected time (e.g. “on-time”) and cycle time measures how long it takes to complete a process. Figure 1 shows an example of a CSF and some KPQs we might want our metrics to help us answer.
Figure 1: Example CSF, KPQs and associated metric types
As you can see in the example, the KPQs are different for timeliness and cycle time. But why might you want to measure a process in these different ways? Figure 2 shows a comparison between timeliness and cycle time KPQs. If you are concerned about whether a study achieved the Database Lock milestone by the planned date, you would use the timeliness metric. For repeat occurrences such as data entry, cycle time and timeliness both may be useful. You can review the percentage of data entries completed by the timeliness target or calculate the actual data entry cycle times.
Figure 2: Comparison of timeliness and cycle time KPQs for study oversight
If you are in the study-planning group, you may want to review the cycle time version of the metrics. Review of cycle times provides you with important information about how long it actually takes to complete the task so you can establish realistic assumptions in future studies.
Finally, process improvement teams often review both timeliness and cycle time metrics.
So, in answer to the question, the metric type will depend on the question you are trying to answer.
A: A histogram is another way of plotting data, like the box and whisker plot we discussed in last month’s column. [Figure 1] The box and whisker graph is used to show the shape of the distribution, the median value and variability.
Figure 1: Box and Whisker Plot of TMF Document Filing Cycle Times
The histogram graph provides you with a different view of the distribution. It groups data into like-sized buckets so you can see the distribution of the data and whether it is similar to a normal distribution curve.
By taking the raw cycle time data, we can group them into groupings or buckets of 20-day increments:
Figure 2: Cycle Times Grouped into 20 Calendar Day Buckets
Figure 2 shows that there are 13 data points less than 20 calendar days cycle time, 17 between 20 and 40 days and so on. When you plot the data using the groupings in Figure 2, you get the histogram chart shown in Figure 3.
Figure 3: Histogram of Cycle Time Values in 20 Calendar Day Buckets
The histogram view of the data allows you to easily see the following:
- The approximate spread is 120 calendar days (0-120 days)
- The most common values are in the 20-40 grouping
What should you review with this histogram chart?
Graphing your data in a histogram chart might help you to identify possible outliers and whether the data has a normal distribution (e.g. whether it’s a “bell curve”). In this example, you can see some longer cycle times in the 60 to 120 day range that create a long tail going off to the right. [Figure 3] This long tail to the right makes the graph asymmetric or right skewed – in other words it doesn’t look like a bell curve. When the curve is skewed, the mean and median values are not the same as is seen with a normal curve. This means that the median is probably a better indication of the average than the mean (see previous Ask the Expert Media vs. Mean column). In the Box and Whisker plot of the same data (Figure 1) the long tail shows as the longer “whisker” on the top of the box.
Where it can get particularly interesting is when you plot the data and find more than one peak. [Figure 4]
Figure 4: Histogram with Two Peaks
This indicates something underlying in the data such as two groups behaving differently. Perhaps one department is submitting documents with a cycle time of around 30 calendar days and another has a cycle time of around 90 calendar days. You would need to investigate the data further to see if you can understand why the histogram looks like this – a box and whisker showing the different departments might be a good start at uncovering what’s going on.
Note that the mean and median of the data for this histogram will be around 60 and yet 60 is not a good representation of the true average – because really there are two groups with their own means and medians at around 30 and 90 calendar days. If we only had the mean or median for the whole data, or only a Box and Whisker plot, we could be misled in this example. [Figure 5]
Figure 5: Figure 4 Data Displayed in a Box and Whisker Plot
Excel 2007 and higher can generate histograms using the Analysis Toolpak.
A: A box and whisker plot is one of the many ways to display data. Developed by John W. Tukey, it works well when there is a lot of data and you want to take a look across different groups. For example, you might have data on the cycle times from document finalized to document published in the eTMF (a metric that the MCC TMF Work Group has recently defined) and want to see how the metric differs by responsible party.
The box and whisker plot has the main box that extends from the first to the third quartile (IQR) with a line showing the second quartile (the median).
Figure 1: Anatomy of a Box and Whisker Plot
The quartiles work similarly to the median we discussed previously in this column. The median is the value where 50% of data is above and 50% below. Similarly, the first quartile is the value of the data point that has 25% of data points below and 75% above. And the 3rd quartile has 75% of data below and 25% of data above. The upper and lower whiskers show the full extent of the data.
The height of the box or interquartile range (IQR), also called the mid-spread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles.1
There are variations of the box and whisker that use rules to determine if particular values are outliers and they are typically shown as circles or asterisks. Also, the mean of the data is sometimes shown with a symbol such as an ‘x’. If you use Excel 2010 or later you can plot these charts easily as in the example below.
Apply the Concept
Figure 2: Box and Whisker Example
Let’s take a look at the box and whisker plots in Figure 2. Here we are showing cycle time data from document finalized to document published in the eTMF. A, B and C are different responsible parties.
How do the median values align?
The overall median is 32 calendar days. The median value of responsible party B is lower than either A or C. Although note that the medians for B and C look similar. Responsible party A has the highest median (longest cycle time).
How do the IQRs compare?
The difference between Q3 and Q1 (the height of the blue boxes) vary among the four plots. C has the largest range of cycle times and B has the smallest range. Since C has a large variation it might be worth exploring why the values of that responsible party is so variable as compared to B. Are there some document types that are being published quickly whilst others are taking a considerable time?
How do the whiskers (max and min non-outlier values) compare?
Responsible party A has the highest minimum value and C has the highest maximum value. Again, responsible party C clearly has a wide variation of cycle times.
Putting it all together
Before going further, you might want to carry out statistical tests to compare the data to see if differences are significant and perhaps review the outliers. But without doing that, the box and whisker plot has given us a strong indication that there are some real differences between the performance of the responsible parties – A, B and C. When you consider both the median value and the IQR, you can see that B is the top performer of the group. It has both the lowest median and the smallest range or variation in values. You might want to try to understand what B is doing and see if those “best practices” can be applied by A and C.
A previously posted Ask the Expect column examined the difference between median and mean values. As you can see in the above box and whisker plots, the mean (the ‘x’) is typically greater than the median (the line). This is often the case with cycle time data, which is why the median is often a better summary statistic for this type of data.
A: Each MCC metric includes a written description of the metric, a formula and a performance target. The purpose of the performance target is to establish performance expectations or a level of performance that is acceptable for that particular time, cost or quality measurement. Without a target, it is difficult to interpret the results and determine whether additional action is required – the target provides the context in which to interpret the results.
MCC Work Group participants define performance targets as part of the metric development process. Many of the MCC metrics developed by the Clinical Operations Sub-Group have green-amber-red performance levels. Results that fall into the “green zone” are good results; results in the “amber zone” are in the to be watched grouping as they fall outside of good results but don’t need immediate action steps; and results in the “red zone” are poor results that require action steps.
Some MCC metrics do not have standardized performance targets because targets varied by therapeutic area.
A: When looking to summarize data people often use the mean (also termed the common average). This works well when the distribution of data is even – looks something like a normal curve. But often data is not normally distributed in this way. This is particularly true when measuring cycle times. These tend to have a low peak and then a long tail. The long tail impacts the mean such that it can be a long way off the peak of the distribution.For example, if the cycle times for completing Monitoring Visit Reports are 5, 9, 10, 10, 10, 11, 11, 24, and 56 days, the Mean is 16.2. But when we look at the numbers, is 16.2 a good representation when 7 out of the 9 cycle times are less than 16.2? This is where the median works better. The middle value in this example is 10 and by definition, half are below and half above. [Figure 1]
There are non-parametric statistical tests that you can run comparing medians (as you can with a T-test for comparing means). They are not as powerful as the T-test but do not rely on the assumption of an underlying normal curve.
A word of caution about the term “average.” Most people use “average” and “mean” interchangeably. However, the definition is ambiguous – it can refer to the median or the mode of the data, too. Here’s a dictionary entry on the term.