My main research interest is exploratory data mining: how can we enable domain experts to explore and analyse their data, to discover structure and ultimately novel knowledge?
The approach I take is to define and identify patterns that matter, i.e., succinct descriptions that characterise relevant structure present in the data. Which patterns matter strongly depends on the data and task at hand, hence defining the problem is one of the key challenges of exploratory data mining. Moreover, I find it very interesting to do fundamental data mining for real-world applications; there is no better way to show the potential of exploratory data mining than by demonstrating that patterns matter.
- PDM – Pattern-based Data Modelling
- IPM – Interactive Pattern Mining
- PaMaS – Patterns that Matter for Science
- PaMaI – Patterns that Matter for Industry 4.0
- DAMIOSO – Data Mining on High Volume Simulation Output
- SAPPAO – A Systems Approach towards Data Mining and Prediction in Airlines Operations
Selected previous projects:
PDM – Pattern-based Data Modelling
Sep 2005 - now
with Antti Ukkonen, and many others
Patterns are ideally suited to characterise structure in data, but traditional pattern mining approaches often have the problem that many patterns are often generated. This issue can be solved by constructing pattern-based models that accurately yet non-redundantly capture the relevant structure in the data. Information theoretic principles —such as the Minimum Description Length (MDL) and Maximum Entropy principles— are very helpful to this end. Pattern-based modelling can be applied to many data types and tasks.
Recent papers on this topic include Subjective Interestingness of Subgraph Patterns, on community detection using the Maximum Entropy principle, and Association Discovery in Two-View Data, on two-view summarisation using the MDL principle.
IPM – Interactive Pattern Mining
Sep 2011 - now
with Vladimir Dzyuba, Siegfried Nijssen, and Luc De Raedt
It is often hard to define upfront which patterns are 'interesting' or not. One approach to address this problem is to involve the human in the loop. That is, by visualising data and patterns, and by asking the domain expert for feedback, it is possible to learn and model the user's preferences. As a result, interactive data mining has the potential to discover more interesting patterns with less effort.
We combined pattern mining with machine learning techniques to establish Interactive Learning of Pattern Rankings. Further, I wrote an overview of the state-of-the-art and future directions titled Interactive Data Exploration using Pattern Mining.
PaMaS – Patterns that Matter for Science
Sep 2011 - now
with Thanh Le Van, Daniëlle Copmans, Peter de Witte, Michael Berthold, Esther van den Bos, Guida Veiga, Carolien Rieffe, and others
To demonstrate the potential of data mining —and pattern-based approaches in particular— I work with many colleagues from other sciences. For example: 1) we have developed new pattern types and algorithms for the Simultaneous discovery of cancer subtypes and subtype features by molecular data integration (in bioinformatics); 2) we have established an improved analysis and workflow for high-throughput screening for drug discovery (in pharmaceutical biology); 3) we are applying subgroup discovery to discover associations between social anxiety and physiological data recorded while giving a presentation (in psychology); and 4) we are applying social network analysis methods to discover associations between social competences and child behaviour on the playground (in psychology).
PaMaI – Patterns that Matter for Industry 4.0
Sep 2015 - now
with KLM, and others (see projects below)
Industry 4.0 is a broad term that encompasses current trends toward automation and the use of data in industry, with the ultimate goal to create smart factories and/or improve operations. Data mining and machine learning techniques are very important ingredients to this end, offering many challenging data opportunities. For example, I have been working with KLM Dutch Royal Airlines to analyse flight operation data. Moreover, each of the following three projects also belongs (at least partially) in this category.
DAMIOSO – Data Mining on High Volume Simulation Output
Sep 2015 - now
with Sander van Rijn, Thomas Bäck, Michael Emmerich, Michael Lew, and
Lars Graening and Markus Olhofer at Honda Research Institute Europe
The DAMIOSO project, funded by NWO and Honda Research Europe, focuses on developing algorithms and tools for data management, data mining and knowledge extraction from massive volumes of data, as generated by modern simulation tools, which are being used in a wide range of industries (aerospace, automotive, shipping, and others), in order to deliver advanced design and process optimisation to support engineers in their design processes.
SAPPAO – A Systems Approach towards Data Mining and Prediction in Airlines Operations
June 2016 - now
with Hugo Proença, Michael Emmerich, Thomas Bäck, and
partners at IIT Roorkee and GE Aviation (India)
By analysing historical flight data and data on the associated disruptive events on the flight network, the NWO-DeitY SAPPAO project aims to optimise the accuracy and reliability of predicting scheduled flight times, thereby potentially saving millions of Euro’s on better utilisation of airplanes, decreased fuel consumption, decreased CO2‐emissions, decrease of ambient noise and better use of time for passengers and airports.
In particular, at LIACS we will focus on feature construction for improved flight predictability and reduced airline operating cost. The challenge in this prediction is that it is not clear which features should be used to obtain the best estimates. There is a wide range of available data, including network data, time series data, and so on, which is not straightforwardly used in existing attribute‐value based machine learning and statistical techniques. This project will deal with these challenges.
PROMIMOOC – Process Mining for Multi-Objective Online Control
Sep 2015 - Aug 2016
with Bas van Stein, Hao Wang, Thomas Bäck, Wojtek Kowalczyk, Michael Emmerich,
Mark Raasveldt and Stefan Manegold at CWI, and
industrial partners BMW Group and Tata Steel
The PROMIMOOC project, funded by NWO, BMW Group and Tata Steel, aims at developing a generic platform for data collection and integration, data-driven modelling and model-based online process control, by which the steel production can be adapted and optimised in real-time.
Like any high-end industrial production process, steel coil production processes at Tata Steel and automotive stamping processes at BMW typically generate huge volumes of high-dimensional process control and product quality data, spread over several plants and process stages. The goals of this project include the integration of data mining, distributed in-memory databases, nonlinear optimisation, and the development of generic techniques for model-based multiple objective process control.