Complex analysis over heterogeneous data sources has become the basis of critical decision making. Nowadays, for numerous analytics tasks, actual content matters more than size. Analysts are baffled by the amount of available sources in search for the "right data" that maximizes analytics performance giving their company a competitive edge. The effect that different data inputs have on analytics tasks is still a missing piece in the puzzle. The ARMADA project (Boosting Analytics Performance: A Data-driven Approach) plans to examine and export a quantitative link between data and the analytics workflows that consume them. This work studies the novel notion of utilizing dataset similarity to infer operator behavior and be able to build scalable, operator-agnostic performance models for real-life popular big-data tasks.
Data profiler is our open-source tool for modeling various analytics operators over multi-dataset inputs.
Dimitrios Tsoumakos, PhD
Associate Professor,
Department of Informatics, Ionian University.
Adjunct Researcher,
Computing Systems Lab,
National Technical University of Athens
Web page