How do you analyse real-world data?

The importance of machine learning and other analysis techniques for oncology

Key insights

  • Flexible forms of market authorisation require the analysis of real-world data.
  • Advanced data analysis techniques such as machine learning are ideally suited to overcoming the limitations of this data.
  • The techniques studied are promising: they allow for more refined patient stratification and can predict the effectiveness of a drug, taking many different factors into account.
  • However, federated data systems are required if these advanced analytical techniques are to be used in practice. 

There is a high demand for innovative cancer drugs. Yet their development is a complex and lengthy affair, longer and more expensive than that for conventional cancer treatments and with little chance of success. The drugs that do make it through need to be made available as soon as possible. The EMA (European Medicines Agency) has therefore created flexible forms of market authorisation, such as conditional authorisation and adaptive pathways. As speed must not come at the expense of safety, these drugs must be monitored even after they have been put into circulation. This is no easy task. Tine Geldof's doctorate demonstrates that advanced data analysis techniques such as machine learning may offer a solution.

Real data is a mess

Randomised controlled clinical trials during drug development make use of survival analyses, statistical techniques for the analysis of time to event data. “However, these conventional techniques are not suitable for real-world data from everyday clinical practice,” explains Tine. “Unlike clinical trials, here you are faced with data that is ambiguous or missing, a diverse patient population, various possible combined therapies, various alternative treatments to compare with and numerous other disruptive factors which you cannot monitor, making it very difficult to uncover causal effects.”

Advanced data analysis techniques can handle these kinds of complex data sets, however. The possibilities offered by resources such as machine learning algorithms are well known and they are being applied successfully in a wide range of sectors. In the pharmaceutical sector, their use is mainly limited for the time being to the discovery phase. Given their potential, this is very unfortunate.

Fear of the black box

When choosing a treatment, doctors – just like payers and the regulator – currently rely on survival analyses from clinical trials. In the case of innovative cancer drugs that come onto the market via flexible routes, they could use analyses of real-world data as the basis for their decision. “However, there are considerable barriers to the use of new analytical techniques,” Tine says. “Human lives are involved, after all, and a common objection is that people don’t like leaving these kinds of decisions to a computer. To overcome this resistance, it is therefore important for the analysis algorithms to be transparent and allow interpretation.”

This is why she chose decision trees, one of the simplest methods of machine learning, and Bayesian networks (see box) for her research. Could these techniques provide reliable information about the performance of a cancer drug?
The real-world data she used came from the Belgian Cancer Registry (anonymous patient data and tumour data) and the InterMutualistic Agency (IMA), which provides data on reimbursed treatments and drugs.


Decision-making trees are very visual, intuitive and easy to interpret. Tine applied the technique to data on patients with glioblastoma, the most common and aggressive form of brain tumour, and treatment with temozolomide to determine the circumstances under which the treatment is effective, i.e. the type of patient for whom it works. This study confirmed the importance of age, a variable that was also used in the clinical studies, but Tine's model also revealed other influencing variables such as the combination with conventional chemo. Comparing the results with those obtained by logistic regression, a statistical classification technique widely used in medical literature, made it possible to demonstrate that the classification of the decision-tree model was usable. The perfect model is impossible to achieve, but it is clear when a model could do better and sometimes that indicates missing data. “That was the case here, too,” Tine recalls. “We know from the literature that patients with a particular genetic abnormality react better to temozolomide, but unfortunately I did not have that data.”

Metastatic colorectal cancer

The Bayesian network was used on a dataset of patients with metastatic colorectal cancer, a cancer for which (unlike glioblastoma) more innovative alternatives are available and whose sufferers tend to survive for much longer. The dataset was therefore more complex, with more missing data. Tine’s model examined the effects of the targeted drugs aflibercept, bevacizumab, cetuximab and panitumumab. “Few clinical trials are available for targeted treatments at this point,” she says. “In addition, they compare the effect of one or a limited number of drugs to the standard treatment. A Bayesian network can analyse the full range and all possible combinations. As a result, it provides better insights into the optimal treatment route.”

More information, better decisions

In her doctorate, Tine wanted to check whether more information could be derived from practical data than from clinical studies using advanced data analysis. “It turns out that my models can stratify the patient population much more finely. This makes it possible to check at almost an individual level whether and for which patient a particular drug would be useful. In practice, these models should also help to determine the most appropriate treatment sequence or combination of drugs. They also provide useful information about the effectiveness of a drug.”

Such models are not intended to replace physicians but can help them to make more informed decisions, based on information provided by payers and/or the regulator or even – why not? – a specialised app. Another important point, with truly sustainable healthcare in mind: Tine's research is also a step towards the performance-based reimbursement of drugs.

The caveat

Every model stands or falls by its data and this is where the problem lies. “Collecting data is a long and complicated process. Fortunately, I was able to link the Belgian Cancer Registry's data to that of the IMA, but not to other useful data on biomarkers, for example. The information available is fragmented and incomplete, spread across different sources with different keys, etc.”

Does that mean that one central database is needed? “No, centralising data takes time and rapid data processing is crucial to ensuring flexible market authorisation based on real-world data,” Tine stresses. “We need to evolve towards a federated data system which does not bring the data to the analysis, but instead brings the analysis to the data. The data remains at the source, but can easily be linked. Researchers do not get the data on their computer but gain access to data from various different sources at the same time, for example using a cloud application. This kind of system is indispensable if you want to stimulate innovative research and sustainable healthcare.” EHDEN, the European Health Data and Evidence Network, is a recent initiative on a European scale.

“And,” she concludes, “in addition to the technical challenges, we must not lose sight of the legal aspects either: the data must of course be collected in a secure manner in accordance with the GDPR.”

What are Bayesian networks?
Like a decision tree, a Bayesian network is an algorithm for analysing and classifying data. It is a model which can be used to calculate the relationships between factors and their impact on the end result. Each node in the network is a variable, each value of which has a certain probability. The model parameters are therefore not constants but stochastic variables. Their value is uncertain and characterised by a probability distribution which in turn is not fixed, but can be updated on the basis of new data (prior knowledge, the basic principle of Bayesian statistics). One of the most significant advantages is that in the case of small datasets and/or non-repeatable events (e.g. data from just one patient), the technique can still predict the results with sufficient accuracy precisely through the use of prior knowledge. The results of a Bayesian network provide an idea of the degree of certainty of a particular hypothesis or claim. As such, a Bayesian network is not an algorithm for machine learning but rather the Bayesian interpretation, i.e. the application of Bayesian statistical inference to such algorithms. 

Source: ‘Advanced analytics in pharmaceutical innovation: The use of real-world evidence in oncology’ by Tine Geldof. Doctorate in Biomedical Sciences at KU Leuven in 2019. Supervisors: Professor Walter Van Dyck (Vlerick Business School) and Professor Isabelle Huys (KU Leuven). Co-supervisor: Professor Lieven Annemans (Ghent University and Vlerick Business School)

Discover our expertise in healthcare and life sciences

Want to know more on what we can offer you on healthcare and life sciences? The Vlerick Healthcare Management Centre brings together leading actors including doctors, hospitals, life science companies, health payers and regulators on finding answers to the big challenges facing the sector.

& Rankings

Equis Association of MBAs AACSB Financial Times