Software

On this page you can find software (i.e., code and/or binaries) for a number of projects in which I have been involved. Are you looking for an implementation that is not listed here? Contact me. Currently available are:

DCM – Description-driven Community Mining

Description-driven Community Mining (DCM) [2] is our solution to finding a diverse set of cohesive communities with concise descriptions in a social network. It has the nice feature of being able to build well-described cohesive communities starting from any given description or seed set of nodes, which makes it very flexible and easily applicable.

  • DCM binaries and C# source code
    Download (for Windows only).

DSSD – Diverse Subgroup Set Discovery

The latest release of my DSSD [3] implementation can be fully configured to:

  1. perform depth-first search or beam search;
  2. use a traditional top-k beam or one of the diverse beam selection strategies;
  3. do sequential or weighted covering using any of the depth-first or beam search strategies;
  4. perform post-selection using any of the subgroup selection strategies;
  5. use one of the Subgroup Discovery quality measures: Weighted Relative Accuracy (standard, multi-class, or numeric), Chi-squared, mean test, (Weighted) KL quality;
  6. use one of the Exceptional Model Mining quality measures: (Weighted) Kullback-Leibler quality, (Weighted) Krimp Gain quality.

Provided are both Windows binaries and the C++ source code:

  • DSSD binaries
    Download -- Should run on any Windows platform (includes binaries for both x86 and x64).
  • DSSD C++ source code
    Download -- Includes solution and project files for Visual Studio 2010, but hardly depends on platform-specific features. In other words, should also compile with different platforms and compilers.

Fast-Skyline – Efficient Discovery of the Cost-Influence Skyline

Fast-Skyline is an algorithm for computing approximate “skylines” (/ Pareto fronts / non-dominated sets) of subsets of size-k subject to two functions, one linear, one submodular. That is, the algorithm computes the set of non-dominated subsets of size-k.

Van Leeuwen & Ukkonen 2015 describes this algorithm in the context of influence maximisation, where the subsets are sets of vertices, the seed sets. We consider the special case where the seed sets have different costs, defined as the sum of vertex-specific costs. We say that a seed set dominates another seed set if it has higher influence and lower cost.

Krimp – Itemsets that Compress

Our implementation of Krimp [4] is freely available for research purposes; we provide both the C++ source code and binaries for Windows (x86 and x64) and Linux. In addition to the pattern set selection algorithm, it contains the Krimp classifier [5] and the StreamKrimp algorithm [6]. For your convenience, the package includes some example UCI datasets taken from the LUCS-KDD data library. Please refer to the documentation in the package for installation/compilation details and usage hints.

  • Krimp binaries and C++ source code
    Download (version 1st of February 2013)

SSG Miner – Subjective Interestingness of Subgraph Patterns

Our implementation of SSG Miner, for Subjective Subgraph Miner, as described in Subjective Interestingness of Subgraph Patterns.

  • SSG Miner binaries and C++ source code
    Download (Windows binaries included, should also compile on other platforms).
  • Supplementary information: appendices B and C
    Download (PDF).

Spectra – Fast Estimation of the Pattern Frequency Spectrum

FastEst and Spectra [1] are algorithms for estimating the number of frequent itemsets in a dataset. Exactly counting the number of frequent itemsets is a #P-complete problem. Our approach, based on the classical algorithm by Knuth to estimate the size of a search tree, is much faster but accurate nevertheless.

The C++ implementation was used for the experiments reported on in our ECML PKDD 2014 paper. In addition, we also provide a JavaScript-based implementation that runs in your browser; a description and some performance benchmarks are in this paper.

Translator – Association Discovery in Two-view Data

The Translator algorithms find small and non-redundant sets of associations that describe how the two views of two-view datasets are related, where two-view datasets are datasets whose attributes are naturally split into two sets. The models, dubbed translation tables, contain both unidirectional and bidirectional rules that span both views and provide lossless translation from either of the views to the opposite view. A score based on the Minimum Description Length (MDL) principle is used for model selection.

The implementation provided here was used for the experiments reported on in our TKDE paper.

  • Translator binaries and C++ source code
    Download (supported: Windows, Linux).

References

[1] van Leeuwen, M. & Ukkonen, A. Fast Estimation of the Pattern Frequency Spectrum. In Proceedings of the ECML PKDD'14, pages ?, 2014.
[2] Pool, S., Bonchi, F. & van Leeuwen, M. Description-driven Community Detection. Transactions on Intelligent Systems and Technology, 5(2):?, ACM, 2014.
[3] van Leeuwen, M. & Knobbe, A. Diverse Subgroup Set Discovery. Data Min. Knowl. Discov., 25(2):208-242, Springer Netherlands, 2012.
[4] Siebes, A., Vreeken, J. & van Leeuwen, M. Item Sets that Compress. In Proc. SDM'06, pages 393-404, 2006.
[5] van Leeuwen, M., Vreeken, J. & Siebes, A. Compression Picks the Item Sets that Matter. In Proc. ECML PKDD'06, pages 585-592, 2006.
[6] van Leeuwen, M. & Siebes, A. StreamKrimp: Detecting Change in Data Streams. In Proceedings of the ECML PKDD'08, pages 672-687, 2008.