Approximate Integration of streaming data
Abstract
We approximate analytic queries on streaming data with a weighted
reservoir sampling. For a stream of tuples of a Datawarehouse we show how to
approximate some OLAP queries. For a stream of graph edges from a Social
Network, we approximate the communities as the large connected components
of the edges in the reservoir. We show that for a model of random graphs which
follow a power law degree distribution, the community detection algorithm is a
good approximation. Given two streams of graph edges from two Sources, we
define the Community Correlation as the fraction of the nodes in communities
in both streams. Although we do not store the edges of the streams, we can
approximate the Community Correlation and define the Integration of two
streams. We illustrate this approach with Twitter streams, associated with TV
programs.