NetFlow and IPFIX raw flow captures are insightful yet, due to their large
volume, challenging to timely analyze and query. In particular, if these
captures span long time periods or are collected at remote locations, storing or
transferring them for analysis becomes increasingly expensive.
Enabling efficient execution of a large range of queries over flow captures while reducing storage and transfer volume requires working with mergeable succinct summaries that capture the most essential features of flows dynamically. However, the problem of building such structures is yet unmet. In this work, we introduce a self-adjusting data structure of generalized flows, called Flowtree, that (1) reduces the storage requirements by more than 95% while providing highly accurate answers for popular hierarchical flows, (2) minimizes transfer cost of flow summaries, and (3) supports several operators with distributed execution and summarization across time and multiple sites. The evaluation of our solution on different network traces confirms that Flowtree can accurately and promptly answer questions about flows using different feature sets. |