The ever-increasing Internet traffic poses challenges to network
operators and administrators that have to analyze large network datasets in a
timely manner to make decisions regarding network routing, dimensioning,
accountability and security. Network datasets collected at large networks
such as Internet Service Providers (ISPs) or Internet Exchange Points (IXPs)
can be in the order of Terabytes per hour. Unfortunately, most of the current
network analysis approaches are ad-hoc and centralized, and thus not scalable.
In this paper, we present Datix, a fully decentralized, open-source analytics system for network traffic data that relies on smart partitioning storage schemes to support fast join algorithms and efficient execution of filtering queries. We outline the architecture and design of Datix and we present the evaluation of Datix using real traces from an operational IXP. Datix is a system that deals with an important problem in the intersection of data management and network monitoring while utilizing state-of-the-art distributed processing engines. In brief, Datix manages to efficiently answer queries within minutes compared to more than 24 hours processing when executing existing Python-based code in single node setups. Datix also achieves nearly 70% speedup compared to baseline query implementations of popular big data analytics engines such as Hive and Shark. |