# Advances in interactive Knowledge Discovery and Data Mining in complex and big data sets (15w2181)

## Organizers

Massimo Ferri (University of Bologna)

Randy Goebel (University of Alberta)

Andreas Holzinger (Medical University Graz)

Vasile Palade (Coventry University)

## Objectives

This workshop will try to borrow and adapt diverse theoretical innovations on probabilistic models and related machine learning methods from other areas, and will focus on probabilistic-based data mining methods, including graph-based data mining, topological data mining and other information- theoretic-based approaches (e.g., entropy-based data mining), as well as on the “human-in-the-loop” concept, supported by an interactive learning and optimization component and in visual analysis of heterogeneous and dynamic data sets. For example, in network-based approaches, statistical extensions of graph theoretical approaches, visualizing networks, the epistemological meaning of inferred networks, the structural analysis of networks, the comparative analysis of networks and network-based biomarkers are significant challenges, to mention only a few. Classical mathematical techniques do often not fit well the task of analyzing, comparing, classifying, retrieving complex data.

Topology (and in particular algebraic topology) is, by its very nature, the part of mathematics which formalizes the qualitative aspects of objects. Therefore, topological data processing and topological data mining well integrate with more classical mathematical tools. For example, persistent homology combines geometry and algebraic topology in the study of pairs (X,f), where X is an object (topological space) and f is a continuous function defined on X (typically with real values). One application is the extraction of topological features of an object out of a cloud of sample points. Another class of applications uses f as a formalization of a classification criterion; in this case, various functions can give different criteria, cooperating in a complex classifier.

Several problems arise from such settings. One, in the application context, is the choice of suitable functions f. This is generally done heuristically, but it would be necessary to have parameterized spaces of such functions and, eventually, a self-driving, optimized choice of f for statistical learning. Another challenge is the construction of good distances. The ones presently available need exponential computation. A third problem concerns functions with multidimensional range: functions from X to R give rise to diagrams whose information is condensed in a discrete (mostly finite) set of points in the plane; but, if the range is $mathbb{R^{k}}$, the same information is carried by (2k-2) dimensional patches in $mathbb{R^{2k}}$. A one-dimensional reduction is available, but it raises computational problems in applications.

This workshop will also discuss approaches beyond data mining.