Loading...

Using Additional Information in Streaming Algorithms

©2016 Textbook 128 Pages

Summary

Streaming problems are algorithmic problems that are mainly characterized by their massive input streams. Because of these data streams, the algorithms for these problems are forced to be space-efficient, as the input stream length generally exceeds the available storage.
The goal of this study is to analyze the impact of additional information (more specifically, a hypothesis of the solution) on the algorithmic space complexities of several streaming problems. To this end, different streaming problems are analyzed and compared. The two problems “most frequent item” and “number of distinct items”, with many configurations of different result accuracies and probabilities, are deeply studied. Both lower and upper bounds for the space and time complexity for deterministic and probabilistic environments are analyzed with respect to possible improvements due to additional information. The general solution search problem is compared to the decision problem where a solution hypothesis has to be satisfied.

Excerpt

Table Of Contents



Contents
List of Figures
List of Tables
1
Introduction
1.1
Overview and Structure of the Thesis
. . . . . . . . . . . . . . . . .
Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Related Work
3
Definitions
3.1
Streaming Problems and Algorithms . . . . . . . . . . . . . . . . . .
3.2
Determinism and Randomization . . . . . . . . . . . . . . . . . . . .
3.3
Space and Time Complexity . . . . . . . . . . . . . . . . . . . . . . .
3.4
Approximation Ratio and Success Probability . . . . . . . . . . . . .
Approximation Ratio of a Streaming Counting Problem . . . . . . .
Success Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
Solvability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6
Basic Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Lower Bounds on Space Complexity
4.1
Introduction to Communication Complexity . . . . . . . . . . . . . .
4.2
Lower Bound for General Streaming Problems . . . . . . . . . . . . .
Lower Bound for an Exact Deterministic Streaming Problem
. . . .
Lower Bound for an Exact Probabilistic Streaming Problem . . . . .
Lower Bound for an Approximative Deterministic Streaming Problem
Lower Bound for an Approximative Probabilistic Streaming Problem
4.3
Lower Bounds for Hypotheses Verifications
. . . . . . . . . . . . . .
5
Most Frequent Item Problem
5.1
General Streaming Problem Analysis . . . . . . . . . . . . . . . . . .
Exact Deterministic Most Frequent Item Problem . . . . . . . . . . .
Exact Probabilistic Most Frequent Item Problem . . . . . . . . . . .
Approximative Deterministic Most Frequent Item Problem
. . . . .
Approximative Probabilistic Most Frequent Item Problem . . . . . .
Summary of the General Most Frequent Item Problem . . . . . . . .
5.2
Hypothesis Verification Analysis
. . . . . . . . . . . . . . . . . . . .
5.3
Analysis Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
v

6
Number of Distinct Items Problem
6.1
General Streaming Problem Analysis . . . . . . . . . . . . . . . . . .
Exact Deterministic Number of Distinct Items Problem . . . . . . .
Exact Probabilistic Number of Distinct Items Problem . . . . . . . .
Approximative Deterministic Number of Distinct Items Problem . .
Approximative Probabilistic Number of Distinct Items Problem . . .
Summary of the General Number of Distinct Items Problem . . . . .
6.2
Hypothesis Verification Analysis
. . . . . . . . . . . . . . . . . . . .
Verification of All Possible Hypotheses . . . . . . . . . . . . . . . . .
Justification of One Correct Hypothesis . . . . . . . . . . . . . . . .
6.3
Analysis Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Conclusion
7.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3
Future Work
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Figures
1.1
Thesis Overview
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Illustration of the space complexity of the exact deterministic most
frequent item problem . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Illustration of the lower bound factor . . . . . . . . . . . . . . . . . .
5.3
Illustration of h(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
Illustration of the probability distribution of r
max
. . . . . . . . . . .
6.2
Illustration of the probability gap of r
max
. . . . . . . . . . . . . . .
List of Tables
5.1
Summary of general most frequent item problem . . . . . . . . . . .
6.1
Summary of general number of distinct items problem . . . . . . . .
vii


Chapter 1
Introduction
The title of this thesis, Using Additional Information in Streaming Algorithms,
presses for three simple questions: What is a streaming algorithm? What is meant
by the all-encompassing additional information? And what can additional informa-
tion be used for in streaming algorithms? We address these questions in the following.
What is a streaming algorithm?
A streaming algorithm processes massive data
streams to compute a certain function on these data. From a practical point of view,
a data stream means that these data are successively generated and not completely
known upfront. The streaming algorithm processes the generated data piecewise to
calculate intermediate and final results for a given streaming problem. Furthermore,
streaming algorithms may only read the streaming data once because, generally, they
cannot store the entire data stream, as the massive stream length often exceeds the
possible storage capacity. Facing these conditions, the basic question is whether a
streaming problem can be solved within practical time and space boundaries or not.
Generally, a poly-logarithmic space complexity in relation to the data stream size is
considered practical. The time needed to process a current data value is also required
to be low, as otherwise the new values generated have to be cached in the meantime,
i.e., stored, which might on the other side exceed the allowed storage size because
queues build up when incoming data stream exceeds processing capacity. The study
of streaming problems includes both the identification of streaming algorithms solving
the streaming problem with a certain space and time complexity and the analysis of
lower bounds on the required intermediate storage size to successively calculate a
certain function on the data stream. The following examples illustrate the kind of
thoughts and analysis necessary to construct an algorithm that solves a streaming
problem.
As an example, a simple streaming problem is the identification of the maximal
integer value within a massive data stream of integers. Let us assume that this data
stream contains 10
8
numbers, each with any integer value between one and 10
9
. Of
course, one can easily solve this problem with an algorithm that simply tracks the
highest already known value. At the end, after reading all 10
8
values, this algorithm
outputs the highest observed value between one and 10
9
. This algorithm would only
require small storage to encode this highest value.
A slightly more complex example would be the calculation of the k-th highest
1

2
Chapter 1. Introduction
value within this same data stream. Once again one could store the already known,
now k highest values and would therefore use k times more storage than for the simple
streaming problem above. For small k, e.g., k < 100, this is generally still practical.
A more difficult streaming problem is, e.g., the calculation of the most frequent value,
i.e., the value which occurs the most within the data stream. Here, one cannot easily
generate a streaming algorithm using sub-linear storage. Furthermore, we will later
present a proof that shows that this problem requires at least linear storage.
With the technique of communication complexity, we will be able to prove space
complexity lower bounds, which imply that any possible algorithm that solves a
certain streaming problem requires at least a certain space size. If the identified
algorithm requires the same amount of bits for its storage as the proven lower bound,
we have found an optimal algorithm in relation to the space complexity. If there is a
gap between the upper and lower bound, it might be possible to find an algorithm
with a lower space complexity.
These types of analyses - finding time and space efficient algorithms to solve a
streaming problem and proving necessary space complexities - are the core of the
study of streaming algorithms. This leads to the second question:
What is meant by the all-encompassing additional information?
Addi-
tional information may help solve a certain streaming problem. One possible usage
of this additional information is when it represents a solution hypothesis. This means
that the streaming algorithm receives a solution hypothesis upfront, e.g., that the
number 20 appears within the data stream the most with a frequency of 39 times.
Now the streaming problem is transformed from a general solution search problem
to the decision problem of verifying whether this hypothesis is true for the given
input stream. For this decision problem of hypothesis verification, one can similarly
analyze the possible space and time complexities.
For the hypothesis verification, a streaming algorithm has to verify any possible
hypothesis for a certain input stream. This type of problem is relevant, if we have a
powerful but untrusted source. Then, we may calculate a solution hypothesis with
this source solving the general streaming problem and verify the hypothesis with an
eventually more space and time efficient algorithm.
What can additional information be used for in streaming algorithms?
The concepts of streaming problems with and without additional information are
compared. Namely, the possible algorithms and proofs for both the general solution
search problem and the hypothesis verification problem determine whether the
additional information is helpful, i.e., the required time and space complexities are
lower. In some cases, the additional information is useless for solving the streaming
problem with lower algorithmic complexities.
More specifically, this thesis introduces some concepts to evaluate, whether the
hypothesis verification is significantly easier than the general solution search problem.
Here, easier means that the algorithm to verify the hypothesis is faster or requires
less storage than the algorithm for the general solution search problem requires in the
optimal case. Additionally, some streaming problems are stated where the hypothesis
verification is equally difficult as the general streaming problem, which means that
the additional information is not useful for the streaming algorithm at all.

1.1. Overview and Structure of the Thesis
3
1.1
Overview and Structure of the Thesis
The goal of this thesis is to analyze the impact of additional information (more
specifically, a hypothesis of the solution) on the algorithmic space complexities of
several streaming problems.
To this end, different streaming problems are analyzed and compared. The two
problems most frequent item and number of distinct items, with many configurations
of different result accuracies and probabilities, are deeply studied. Both lower and
upper bounds for the space and time complexity for deterministic and probabilistic
environments are analyzed with respect to possible improvements due to additional
information. The general solution search problem is compared to the decision problem
where a solution hypothesis has to be satisfied.
illustrates this approach, which is explained in more detail in the
following. Based on this illustration, the following different layers are described:
Streaming problems (gray boxes), decision vs. search problems (dark blue boxes),
deterministic vs. probabilistic settings (light blue boxes), exact solutions vs. ap-
proximations (yellow boxes) and lower vs. upper bounds of algorithmic complexities
(orange boxes).
Figure 1.1. Thesis Overview
Streaming Problems:
The following streaming problems are analyzed. The goal
of the most frequent item problem is to identify the value which is most frequent
within a data stream x = (x
1
, x
2
, . . . , x
n
). The streaming problem number of distinct
items identifies the amount of different values, i.e., the size of the subset of the data
stream without any duplications. As sketched in the last section of this thesis, the
introduced approach and technique to analyze upper and lower bounds that are used
to analyze the two streaming problems can be used for further streaming problems
as well.

4
Chapter 1. Introduction
Decision vs. Search Problems:
Our aim is to prove or disprove the possible
benefits of additional information (in the form of a solution hypothesis) for solving
one of the mentioned streaming problems. Therefore, the best possible algorithms
with respect to space and time complexity are identified for both the search problem,
where the optimal solution has to be found and for the decision problem, where one
has to verify or refuse a certain solution hypothesis. One can verify that the decision
problem is not more complex than the search problem because, for the decision
problem, one can always compute the solution as in the search problem and compare
it to the hypothesis. Therefore, the question is, whether the decision problem (with
additional information, i.e., a hypothetical solution) has a smaller complexity, and if
so, by how much.
Deterministic vs. Probabilistic Approaches:
Deterministic algorithms differ
from probabilistic ones in both the result accuracy and the required time and
space complexity. Therefore, the impact of additional information (i.e., a solution
hypothesis) is compared for both deterministic and probabilistic algorithms.
Exact Solutions vs. Approximations:
Streaming optimization problems such
as the most frequent item problem may allow an approximation of the solution.
Identifying an exact solution and allowing approximative results yield fundamentally
different implications on the time and space complexities of its streaming algorithms.
Lower vs. Upper Bounds:
For a certain streaming problem and its configuring
setting (decision or search problem, deterministic or probabilistic approach, and
exact solution or approximation), the best possible algorithms with respect to space
and time complexity are identified. Concrete algorithms deliver upper bounds. Fur-
thermore, formal proofs lead to space complexity lower bounds for certain streaming
problems.
Structure of the Thesis
After this introduction,
discusses an overview of the literature providing
similar topics.
introduces and describes the relevant definitions.
covers the approaches to prove space complexity lower bounds.
and
contain the detailed analysis of the most frequent item problem and the
distinct item problem. The summary of the results, the thesis conclusion and an
outlook on possible future work is given in

Chapter 2
Related Work
In this chapter, we introduce some relevant papers, articles and books for the topic of
this thesis. We first state some related work on streaming problems and algorithms.
To prove space complexity lower bounds, we use the technique of communication
complexity. Therefore, we state some resources about communication complexity
and its usage on other problems. At last, we list five papers on the two analyzed
streaming problems, the most frequent item problem and the number of distinct items
problem.
Streaming Problems and Algorithms:
There are different resources addressing
the area of streaming problems and data streams. Aggarwal [
] gives a brief
overview of the characteristics of a data stream and lists several interesting streaming
problems. Babcock et al. [
] study data streams from a more practical point
of view, including optimized database queries and time measures. A more formal and
theoretical analysis of some streaming problems is given by Muthukrishnan [
The doctoral thesis by Prakash about Efficient Delegation Algorithms for Outsourcing
Computations on Massive Data Stream [
] is a recommended literature because
it introduces the relevant parts of streaming problems, their characteristics and their
analysis including lower-bound proofs.
Communication Complexity:
For a significant analysis of streaming problems,
we have to prove space complexity lower bounds, besides designing efficient streaming
algorithms. For these proofs, we will use the technique of communication complexity,
which was originally introduced by Yao [
]. The principles of communication
complexity are introduced in detail later on, namely, in
. A more broad
and formal introduction is given by Hromkovic [
]. A more practical resource
on communication complexity is given by Roughgarden [
], which focusses on
communication complexity for algorithm designers. Halstenberg and Reischuk [
introduce different communication complexity models and Kremet et al. [
focus on one of these models, the one-round communication complexity, which is also
the basis for the lower-bound proofs of the two analyzed streaming problems. Babai
et al. [
] introduce different communication complexity classes. Yao [
Paturi and Simon [
], as well as Ablayev [
] wrote useful papers on proving
lower bounds in a probabilistic setting. For our lower bound proofs in a randomized
setting, we will use the theorems by Ablayev [
5

6
Chapter 2. Related Work
Based on the communication complexity theory, there are different papers il-
lustrating the usage of this technique.
As an example, Kalyanasundaram and
Schnitger [
] analyze the probabilistic communication complexity of set intersec-
tion. The distributional complexity of disjointness is analyzed by Razborov [
Munro and Paterson [
] focus on selection and sorting with limited storage
and the approximate counting of inversions in a data stream is studied by Ajatai
et al. [
]. Indyk and Woodruff [
] analyze the optimality of frequency
moment algorithms.
Most Frequent Item and Number of Distinct Items Problem:
The two
studied streaming problems of this thesis are analyzed in several resources. Carmode
and Hadjieleftheriou [
] as well as Manku and Motwani [
] give concrete
implementations of algorithms for the most frequent item problem. The paper
Space Complexity of Approximating the Frequency Moments by Alon et al. [
introduces an efficient algorithm for the number of distinct items problems and
analyzes the space complexity lower bounds of both streaming problems for a
deterministic and probabilistic setting. Karp et al. [
] as well as Trevisan and
Williams [
] state further lower bounds for the most frequent item problem for
some conditions of the input values.
The relevant results of these resources are gathered in detail in the corresponding
chapters of the streaming problems, namely in
for the most frequent item
problem and in
for the number of distinct items problem.

Chapter 3
Definitions
Before we can start analyzing concrete streaming algorithms with respect to space
and time complexity, we first have to formally define what a streaming algorithm is.
Yet before, a definition for a streaming problem has to be provided. Additionally,
in the following chapters we will often use special cases of streaming problems or
algorithms ­ these special cases are also introduced here. Therefore, this chapter
contains several basic definitions for streaming problems and algorithms.
First, the general definitions for streaming problems, algorithms and their spe-
cial cases are given (
). Then, for streaming algorithms, we
introduce how the time and space complexities are identified (
). The
problem characteristics success probability and approximation ratio are explained
). Definitions on solvability (
) and a few basic calculations for
further analysis (
) conclude this chapter.
3.1
Streaming Problems and Algorithms
In this section, streaming problems and algorithms are introduced and formally
defined. First, the general definition of a streaming problem is explained and then,
two special cases are defined: The streaming counting problem and the streaming
decision problem. All analyzed streaming problems can be reduced to one of these
two. The general definition of the streaming problem is introduced to show the
similarity of these two cases and to avoid duplications of definitions and theorems
that are valid for both cases. These formal definitions are required to allow the
mathematical analysis of space complexity lower bounds.
As outlined in the introduction (
), a streaming problem is mainly
characterized by the enormous size of the data stream as its input. As we will see in
the following definition, this input stream is defined as a sequence of n input values,
where n represents a very large number. The set of feasible output values defines
which outputs are allowed for every possible input stream. Finally, the problem
function defines which one of the feasible output values is the optimal or correct one
for a certain input stream. All of this together leads to the following definition.
7

8
Chapter 3. Definitions
Definition 3.1. A streaming problem S is a triple (X , Y, f ) containing a class
of input streams X , where each input stream x
j
X is a sequence of n input values,
i.e., x
j
= (x
j
1
, . . . , x
j
n
), with n N
+
. The second element of the triple is a class
of feasible output sets Y = {Y
x
1
, . . . , Y
x
d
} with d = |X |, where every input stream
x
j
X has a corresponding feasible output set Y
x
j
. Additionally, the streaming
problem contains a problem function f : X Y
x
, where, for every input stream
x X , one or more values from the feasible output set are indicated as optimal or
correct output value, namely f (x) Y
x
.
With this general definition of a streaming problem, one can either define a
classical optimization problem or a decision problem. For optimization problems, one
has to be defined how good non-optimal output values are, such that this metric can
be optimized. For a decision problem, the problem function can simply be interpreted
as determining the correct decisions.
As an illustration of this definition, the simple streaming problem from above
), in which the task was simply to identify the maximum integer value
within the input stream, could be formally defined as follows.
· Input streams: Each input stream x = (x
1
, . . . , x
n
) X contains a series of n
integers between 0 and a fixed m N
+
, namely, x
i
{0, . . . , m}.
· Feasible output sets: For every input stream x X , the corresponding output
set allows every integer between 0 and m, namely Y
x
= {0, . . . , m} Y.
· Problem function: For every certain input stream x X , it defines the optimal
or correct output value from the feasible output set Y
x
= {0, . . . , m}, which is
f (x) = max{x
i
| 1 i n}.
With this definition of X , Y, and f , we defined a streaming problem S = (X , Y, f ),
that searches for the maximum value in an input stream. The introduced formalism
seems to be too complicated for such a simple problem. But for more complicated
problems, this level of formalism is required to identify algorithms that solve it and,
especially, to prove some space complexity lower bounds.
Now, we will distinguish two special cases of the general streaming problem, the
streaming counting problem and the streaming decision problem.
First, we will introduce the streaming counting problem. In counting problems,
we deal with series of integers as input streams and integers as output values, and
the aim is always to evaluate some counting function. Some examples of streaming
counting problems are the sum of all input values, their median, or the most frequent
item.
A streaming counting problem has input streams with n integer values from the
set from 1 to m N
+
. The resulting counting value of an input stream is an integer
between 0 and a fixed value s(n, m), which depends on the input stream length and
its values. This result size is specific to each streaming problem.
We will describe a few examples of how s(n, m) could be defined to illustrate
its usage. To find a maximum number and its position within the input stream,
s(n, m) = max{n, m} is a reasonable result size, as higher values are not possible. To
identify the most frequent item within the input streams, s(n, m) = n seems suitable,

3.1. Streaming Problems and Algorithms
9
as an item cannot occur more often than n times. To sum up all numbers within
the input stream, s(n, m) = nm is a meaningful result size. Now, the feasible output
values of an input stream are any possible counting values between 0 and s(n, m).
The problem function is just the counting function on the input stream, which
is the core of a certain counting problem. The goal of a streaming algorithm is to
produce this optimal value or, in some cases, a good approximation of this optimal
value. We can capture this in the following definition.
Definition 3.2. A streaming counting problem S
#
= (X , Y, f, s) is a streaming
problem S = (X , Y, f ) together with a result-size function s(n, m) N
+
with the
following additional constraints: The input streams x = (x
1
, . . . , x
n
) X have input
values x
i
{1, . . . , m}, with m N
+
. For every feasible input stream x X , the
possible output values are y Y
x
with y {0, . . . , s(n, m)}. The problem function
defines, for every input stream x X , exactly one counting value y Y
x
as the
optimal counting result.
An example of a streaming counting problem is the frequency moment.
Definition 3.3. For every k N, we define the frequency moment F
k
of a
streaming counting problem S
#
= (X , Y, f, s) as
F
k
:=
m
j=1
(f
j
)
k
,
where f
j
counts the frequency of the item j in the input stream, namely,
f
j
:=
i | 1 i n and x
i
= j
.
F
0
counts the number of distinct items within the input stream, F
1
counts the
input stream length, namely F
1
(x) = n and F
is defined as the most frequent item,
namely, F
:= max
m
j=1
f
j
.
Note that, indeed, we have a relation between the general definition of the
frequency moment f
k
for f N and F
because
k
F
k
---
k
F
= max
1jm
f
j
.
Besides the streaming counting problem, the second special case is the streaming
decision problem. The only constraint for this streaming problem type is that the
problem function defines exactly one feasible output value as correct. A general
streaming problem might have several optimal output values, as, e.g., the streaming
problem answering the question: Which is the shortest path from one point to
another? Therefore, the streaming decision problem is defined as follows.
Definition 3.4. A streaming decision problem S
decision
= (X , Y, f ) is a stream-
ing problem with the constraint that the problem function defines, for every possible
input stream, exactly one feasible output value as the correct one.

10
Chapter 3. Definitions
A binary streaming decision problem allows only two different feasible output
values, namely Y
x
:= {0, 1}. For every input stream, either 0 or 1 is the correct
decision.
With the general and the two specialized definitions of a streaming problem, we
will now discuss streaming algorithms, which may solve a certain streaming problem.
Just like any classical algorithm, a streaming algorithm also produces a feasible
output value for every possible input stream. As known from the introduction,
streaming problems have massive data streams, which can only be read once, as
the size of the entire input generally exceeds the storage limit. Now, a streaming
algorithm can be understood as a series of n update-algorithm computations, in
which every update-algorithm computation processes one input value and modifies
the intermediate storage. This intermediate storage is a binary string of maximal size
d N
+
that has in total 2
d
different states. Of course, a concrete implementation of
a streaming algorithm does not have to store all d bits, e.g., when the first half of the
bit string is only containing zeros. Therefore, the set of all storage states contains
the empty bit string , the bit string "1" and all possible combinations of bit strings
of length at most d which have a "1" at the beginning. Formally, this cache set is
defined as:
C = {, 1} (1, x) | x {0, 1}
i
and 1 i d - 1 .
As required, the different storage states have a size of 2
d
, because,
|C| = |{, 1}| +
d-1
i=1
2
i
= 2 + (2
d
- 2) = 2
d
.
The initial cache state is the empty bit string . During the algorithm execution,
this cache can be used to store intermediate results. Finally, after considering all
n input values, an output algorithm transforms the last cache state into a feasible
output value of the streaming problem. This leads to the following definition of a
streaming algorithm.
Definition 3.5. A streaming algorithm A = (A
update
, A
output
) for a streaming
problem S = (X , Y, f ) contains an update algorithm A
update
such that A
update
(c
i-1
, x
i
) =
c
i
and an output algorithm A
output
such that A
output
(c
n
) = y Y.
For an input stream x X , the algorithm A computes a feasible output value
y = A(x) using the following approach: First, the cache c
0
C is empty, namely
c
0
= . Iteratively, all n inputs values are processed with the update algorithm
A
update
, namely, for all i (1, . . . , n), c
i
:= A
update
(c
i-1
, x
i
). Then, the output value
y Y
x
is generated as y := A
output
(c
n
).
This means, the streaming algorithm A computes a series of intermediate cache
storages c
i
C.
The first update algorithm computation processes an empty
cache state c
0
= and the first input value x
1
(x
1
, . . . , x
n
) = x, namely
A
update
(c
0
, x
1
) = c
1
. The next update computation processes the second input
value, namely A
update
(c
1
, x
2
) = c
2
. This is analogously repeated, until the last input
value is processed, namely A
update
(c
n-1
, x
n
) = c
n
.
We write A
k
update
(c
i
, x) = c
i+k
to indicate that the update algorithm is executed
k times on the starting cache state c
i
, using the input values (x
i+1
, . . . , x
i+k
). The

3.2. Determinism and Randomization
11
input values have to be a substream of the input stream, namely (x
i+1
, . . . , x
i+k
) is
a substream of (x
1
, . . . , x
n
) = x. One can therefore conclude that,
A(x) = A
output
(c
n
) = A
output
(A
n
update
(c
0
, x)) = A
output
(A
n
update
(, x)).
We say that a streaming algorithm A solves a certain streaming problem S if A
is a streaming algorithm for S as defined above. This implies that for every input
stream x X of the streaming problem, the streaming algorithm computes a feasible
output value, namely A(x) Y
x
for all x X .
We assume, that the streaming algorithm knows the input stream size n.
The definition of streaming problem and streaming algorithm are leaned on the
idea by Komm [
], Prakash [
], and Muthukrishnan [
]. The definition
of the frequency moment is based on the results by Alon et al. [
3.2
Determinism and Randomization
Streaming problems can be solved by streaming algorithms that are either deter-
ministic or probabilistic. For any input stream, a deterministic streaming algorithm
produces the same intermediate cache states and final results in every computation.
Therefore,
defines a deterministic streaming algorithm A
det
. The
next definition will demonstrate the essential difference between deterministic and
probabilistic streaming algorithms.
Probabilistic algorithms use random decisions in their computations. Random-
ization is modelled by a random bit string. This random bit string contains a
finite number of random bits, which are used for the algorithm's computation. The
number of random bits depends on the input stream size n. Therefore, we define
the length of the random bit string of at most d(n) N
+
. For the computation of
a probabilistic algorithm A
prob
, both update and output algorithm may use some
random bits b B =
d(n)
l=0
{0, 1}
l
. The number of consumed random bits at an
update step or in the output algorithm can vary, based on the input stream and on
previous random decisions. But the sum of all consumed random bits is limited by
d(n). The consequence of the use of these random bits is that the algorithm does
not necessarily always produce the same cache states or the same final results. We
define a probabilistic streaming algorithm as follows.
Definition 3.6. A streaming algorithm A
prob
for a streaming problem S = (X , Y, f )
is called probabilistic if the algorithm uses, within the update or output computations,
a random bit string b of length l d(n) N
+
, namely b {0, 1}
l
. The term A
prob
(x)
defines a probability distribution over all feasible output values y Y
x
, namely, for
all y Y
x
, we have 0 Pr[y = A
prob
(x)] 1 and
yY
x
Pr[y = A
prob
(x)] = 1.
Note, that Pr[. . . ] is the used notation to indicate the probability of a certain
event according to the given probability distribution.
As described, probabilistic algorithms do not necessarily always produce the
same results for a fixed input stream x X . We define the deterministic streaming
algorithm A
b
that uses an already generated random bit string b and produces the
output A
b
(x) = A(x, b) = y Y
x
. One can see that A
prob
(x) computes the same

12
Chapter 3. Definitions
probability distribution as the set of deterministic algorithms A(x, b) where the
bit string b B is chosen uniformly at random. Using a random bit string of size
l d(n), the probability of producing a certain output value y Y
x
is defined by the
probability distribution and can be calculated by identifying the number of different
random bit strings which lead to the corresponding output value. Namely,
Pr[y = A
prob
(x)] =
|{b | b B and y = A(x, b)}|
2
l
.
For a probabilistic streaming algorithm A
prob
for a decision problem S
decision
,
the two special cases one-sided and two-sided Monte Carlo algorithm are defined.
Monte Carlo algorithms require the streaming decision problem to have binary
decision values, i.e., there are only two possible decision values Y
x
= {0, 1}. In the
following, these two types of algorithms are introduced.
Monte Carlo Algorithms
We call a probabilistic algorithm A
prob
for a binary streaming decision problem
S
decision
, in which there are only two decision possibilities, namely Y
x
= {0, 1} for all
input streams x X , to be a Monte Carlo algorithm if we have a certain required
success probability. This probabilistic streaming algorithm has to identify the correct
decision with a probability of at least
1
2
. We further distinguish between the one-sided
Monte Carlo algorithm and the two-sided Monte Carlo algorithm.
A one-sided Monte Carlo algorithm allows false positives, but never false negatives.
This means that if the correct decision is one (f (x) = 1), then the algorithm always
has to indicate this result. However, if the correct decision is zero (f (x) = 0), then
the algorithm is required to produce this result at least in every second computation
on average. This leads to the following definition.
Definition 3.7. Let A
prob
be a probabilistic streaming algorithm for a binary stream-
ing decision problem S
decision
with Y
x
= {0, 1}. We call A
prob
a one-sided Monte
Carlo algorithm if, for all x X with f (x) = 1, Pr[f (x) = A
prob
(x)] = 1 and for
all x X with f (x) = 0, Pr[f (x) = A
prob
(x)]
1
2
.
Two-sided Monte Carlo algorithms are similar to one-sided algorithms, but now
both false positives and false negatives are allowed. However, the probability of
identifying the correct decision has to be at least
2
3
. Therefore, we have the following
definition.
Definition 3.8. Let A
prob
be a probabilistic streaming algorithm for a binary stream-
ing decision problem S
decision
with Y
x
= {0, 1}. We call A
prob
a two-sided Monte
Carlo algorithm if, for all x X , Pr[f (x) = A
prob
(x)]
2
3
.
The definition of a probabilistic streaming algorithm is leaned on the results by
Hromkovic [
], Definition 2.5.5.1 on randomized protocols. The definition on 1-
and 2-sided Monte Carlo algorithms are similar to the Definition 2.5.5.4 in [

3.3. Space and Time Complexity
13
3.3
Space and Time Complexity
A streaming algorithm A for a streaming problem S can be analyzed with respect to
its space and time complexity. The space complexity states the maximal number of
required bits in the storage over the entire computation. As stated in the definition of
a streaming algorithm (
), the algorithm is a composition of an update
algorithm, which is executed n times for every input value and an output algorithm.
Each update algorithm computation produces an intermediate result c
i
C, which is
stored in the cache. The space complexity is the maximal number of bits required to
store the intermediate results and the output value in the cache. We use the notation
of |c
i
| and |y| to represent the required number of bits to store the content of the
cache state c
i
and the output value y. Note, that |c
i
| and |y| are not the absolute
values, as both do not have to be numbers. Formally, we define the space complexity
of a streaming algorithm A and a certain input stream x X as follows,
space(A, x) = max |c
1
|, |c
2
|, . . . , |c
n-1
|, |y|
= max |A
update
(, x
1
)|, |A
2
update
(, x)|, |A
3
update
(, x)|,
. . . , |A
n
update
(, x)|, |A(x)| .
The overall space complexity of a streaming algorithm is simply the worst case
space complexity on all possible input streams with the corresponding stream length
n, i.e.,
space(A) = max space(A, x) | x X .
With the space complexity analysis of a streaming algorithm, one has an upper
bound of the space complexity of the streaming problem S. The space complexity
of the streaming problem is the space complexity of the best streaming algorithm
solving the problem, i.e.,
space(S) = min space(A) | A solves S .
We conclude with this definition of the space complexity approximation.
Definition 3.9. For any streaming algorithm A that solves S, we define the space
complexity as follows.
space(S) space(A) = max space(A, x) | x X S .
We have seen how the space complexity of a streaming algorithm A can be
evaluated. On the other hand, the time complexity indicates the required computation
time. Time complexity analysis is always done using the big-O notation. Basic
mathematical operations such as comparison, addition, or multiplication on values
with a fixed size are made within O(1) time steps. These basic mathematical
operations on numbers relative to n, i.e., i {1, . . . , n}, require a time complexity
of O(log(n)). Similarly, computations on the input values of a streaming counting
problem with input values x
i
{1, . . . , m} have a time complexity of O(log(m)).

14
Chapter 3. Definitions
Read or write requests to the storage of (current) size k require O(log(k)) time steps
for a random-access storage.
With these time measures, the time complexities of the update and output-
algorithms can be identified. Formally, time(A
update
) and time(A
output
) are the sum
of the individual basic time measures. The update time complexities indicate the
required time for processing a single input value. For the time complexity analysis
of a streaming algorithm, we are interested in two different measures. First, the
worst-case time complexity for a single update algorithm basically indicates how fast
the individual input values of an input stream can be evaluated or, in other words,
how fast they might arrive.
Definition 3.10.
We define the update time complexity of a streaming algorithm
as the worst-case time complexity of a single update algorithm execution, which is,
update-time(A) = max{time(A
update
(c
i-1
, x
i
)) |
1 i n, c
i-1
C and x
i
x X }.
Second, the overall time complexity is the sum of all update time complexities
and the output time complexity. For a fixed input stream x X , this is:
time(A, x) =
n
i=1
time(A
update
(c
i-1
, x
i
))
+ time(A
output
(c
n
)).
We state the second definition on the time complexity.
Definition 3.11. The overall time complexity is the worst-case time complexity
for any possible input streams x X :
time(A) = max{time(A, x) | x X }.
For probabilistic algorithms, the worst-case time complexities state the maximal
required time for any possible random decision b B.
3.4
Approximation Ratio and Success Probability
In this section, the approximation ratio for streaming counting problems is introduced,
which defines the approximation of an output value, compared to the optimal output
value. Afterwards, the success probability for probabilistic algorithms is defined
that renders the probability that the algorithm produces the optimal output value
or one within an acceptable approximation ratio. With approximation ratio and
success probability, the deterministic, exact streaming problem can be modified to
get a deeper analysis and a more differentiated understanding of the usefulness of
additional information.
Approximation Ratio of a Streaming Counting Problem
A streaming algorithm A (a deterministic or probabilistic one) for a streaming
counting problem S
#
(see
) may produce results with a tolerated

3.4. Approximation Ratio and Success Probability
15
approximation of the optimal output value. We will first introduce how one can
identify the approximation ratio of a given streaming algorithm for a streaming
maximization problem.
Basically, the approximation ratio stands for the fraction between the optimal
result and the output value produced by the streaming algorithm. The approximation
ratio may have any value R, 1, where an approximation ratio of = 1
implies an optimal result. In more general terms, the lower the approximation ratio,
the more accurate the result.
Suppose we are given a deterministic streaming algorithm A
det
that solves a
streaming counting problem S
#
= (X , Y, f, s). Then, for a certain input stream
x X and the output value A
det
(x) = y Y
x
, we define the approximation value
for this specific input stream as the fraction between the produced output value and
the optimal value, namely,
approx(A
det
, x) = max
f (x)
A
det
(x)
,
A
det
(x)
f (x)
.
The max-clause is used for both the approximation value below the optimal
output value (first term) and the approximation values above the optimal output
value (second term). We observe that if and only if the streaming algorithm pro-
duces the optimal value, the approximation value is 1. Furthermore, we define the
approximation ratio over all input streams as follows.
Definition 3.12. Let A
det
be a deterministic streaming algorithm for a streaming
counting problem S
#
= (X , Y, f, s). The approximation ratio is the maximal
approximation value among all possible input streams, which is,
approx-rate(A
det
, X ) = max{approx(A
det
, x) | x X }
= max
f (x)
A
det
(x)
,
A
det
(x)
f (x)
x X .
We observe that this definition of the approximation ratio may always allow
approximative output values both above and below the optimal output value. Some-
times, only a 1-sided approximation with feasible output values below the optimal
result is allowed. Then, we define the approximation value only with output values
below the optimal result as valid. This is formally defined as follows.
Definition 3.13. Let A
det
be a deterministic streaming algorithm for a streaming
counting problem S
#
= (X , Y, f, s). The 1-sided approximation ratio is the
maximal 1-sided approximation value among all possible input streams, which is,
1-sided-approx-rate(A
det
, X ) = max{1-sided-approx(A
det
, x) | x X } with
1-sided-approx(A
det
, x) =
f (x)
A
det
(x)
,
if A
det
(x) f (x)
,
else.
(3.1)
This definition of the 1-sided approximation ratio assumes that the streaming
counting problem is a maximization problem. With the second case of
, we
ensure feasible output values below the optima one, as the approximation ratio is
infinite and, as a consequence, useless.

16
Chapter 3. Definitions
Corollary 3.1. For a certain deterministic streaming algorithm A
det
that solves a
streaming counting problem S
#
with an approximation ratio approx-rate(A
det
, X ) = 1,
A
det
always produces an optimal output result for every possible input stream x X .
For a probabilistic streaming algorithm A
prob
, we can determine a success proba-
bility for any fixed approximation ratio 1, which is introduced in the following.
For a single computation, which uses a certain random bit string b B, we can
similarly define the approximation value,
approx(A
prob
, x, b) = max
f (x)
A
prob.
(x, b)
,
A
prob
(x, b)
f (x)
.
Based on this approximation value of a single computation, we will see in the
following how one can evaluate the lowest approximation ratio for a fixed success
probability such that the success probability is still given. Some streaming algorithms
have the possibility to produce an arbitrarily good approximation ratio = 1 + for
any small 0.
Success Probability
Besides the approximation ratio, we introduce the success probability. For a proba-
bilistic streaming algorithm A
prob
that solves a certain streaming counting problem
S
#
, we define the success ratio as the probability that A
prob
produces a result within
the tolerated approximation ratio. We define, for a fixed input stream x X , the set
of all output values within the approximation ratio 1 as ~
Y
x
, which is,
~
Y
x
=
y | y Y
x
and
f (x)
y f (x) · .
Given a probabilistic streaming algorithm A
prob
for a streaming counting problem
S
#
and a fixed approximation ratio 1, the success probability to produce
an output value within the tolerated approximation is the sum of all individual
probabilities, that A
prob
outputs any ~
y
x
~
Y
x
. Therefore,
success(A
prob
, x, ) = pr
f (x)
A
prob
(x) f (x) ·
=
~
y
x
~
Y
x
pr A
prob
(x) = ~
y
x
.
Similarly to the approximation ratio, we can define the success probability of a
probabilistic streaming algorithm for a streaming counting problem as the lowest
success ratio among all possible input streams.
Definition 3.14. For a probabilistic streaming algorithm A
prob
solving a streaming
counting problem S
#
= (X , Y, f, s) and an approximation ratio 1, the success
probability for the streaming counting problem is defined as,
success-probability(A
prob
, X , ) = min{success(A
prob
, x, ) | x X }.

3.5. Solvability
17
If and only if the success probability is 1, then the probabilistic streaming algo-
rithm always produces a result within the allowed approximation ratio. We use the
notation S
,p
to indicate a streaming problem with an approximation ratio of 1
and a success probability of p 1.
Similarly to the case of a streaming counting problem, we can define the success
probability for a probabilistic streaming algorithm A
prob
on a streaming decision
problem S
decision
. As we do not have approximation ratios because streaming decision
problem just distinguish between correct and wrong decisions, the success probability
just measures the minimum probability to produce the correct result over all possible
input streams.
Definition 3.15.
For a probabilistic streaming algorithm A
prob
solving a streaming
decision problem S
decision
= (X , Y, f ), the success probability for the streaming
decision problem is defined as,
success-probability(A
prob
, X ) = min Pr[A
prob
(x) = f (x)] | x X .
As stated before, studies of approximation ratios of 1 for streaming decision
problems are meaningless. Here, we are only interested in the success probabilities of
exact solutions using the same formulas as above but simply with = 1. Generally,
a probabilistic streaming algorithm A
prob
for a streaming decision problem S
decision
is required to have a fixed success probability of at least p
1
2
. Otherwise it is
possible that a wrong decision may be amplified more often than the correct one. An
exception is the 1-sided Monte Carlo algorithm, as there are only two possible values
and the wrong decision cannot be amplified, as false negatives are not allowed.
In a deterministic environment, the study of different success probabilities other
than 1 is obviously not meaningful, as deterministic streaming algorithms do not use
any random bits.
3.5
Solvability
In this section, some measures are introduced by which a certain streaming algorithm
can be identified as practical or useful. As explained in the introduction, the algorithm
cannot store the entire input stream and compute the entire function in the last final
output algorithm. It was also stated that the required space complexity should be
sub-linear. But what does this mean in detail?
We will recall the definition from the well-known complexity class POLYLOG and
use this measure to classify the required space complexity of a streaming algorithm
as space-efficient (if it is poly-logarithmic) or not space-efficient (if it is not poly-
logarithmic). If a space-efficient algorithm for a certain streaming problem has a
poly-logarithmic time complexity for each update algorithm computation, we then
call the corresponding streaming problem efficiently solvable. If we can prove that
any possible streaming algorithm that solves a certain streaming problem has either
a not space-efficient space complexity or any update algorithm computation has
a time complexity outside the poly-logarithmic class, we then call this streaming
problem not efficiently solvable.

18
Chapter 3. Definitions
Later, this classification of streaming problems into efficiently solvable and not
efficiently solvable will be very helpful to distinguish different streaming problems
and the possible advantage of additional information like hypothesis verification.
Now we will introduce these categories more formally.
First, we recall: A streaming algorithm A is said to solve a streaming problem
S
,p
with an approximation ratio 1 and a success probability p 1 if, for every
possible input stream x X , it writes an output value within the approximation
ratio with at least the according success probability. Streaming decision problems
always have an approximation ratio of = 1 to ensure correct decisions.
As a basis: The poly-logarithmic complexity class POLYLOG(n) includes all al-
gorithms A, where there exists a d N
+
, such that time-complexity(A) O(log
d
(n))
and space-complexity(A) O(log
d
(n)). The word polylogarithmic is standardized by
NIST and , for example, listed in the dictionary of algorithms and data structures
by Sant [
Definition 3.16. A streaming algorithm A for a streaming problem S = (X , Y, f )
with input streams x X of length n = |x| is said to be space-efficient, if
space(A, X ) POLYLOG(n). On the other hand, if space(A, X ) POLYLOG(n),
the streaming algorithm A for S is called not space-efficient.
A streaming problem S is said to be efficiently solvable if there exists a space-
efficient streaming algorithm A that solves the streaming problem S and each update
algorithm computation has time(A
update
) POLYLOG(n) and the output algorithm
has time(A
output
) POLYLOG(n).
A streaming problem S is said to be not efficiently solvable if, for any possible
streaming algorithm A that solves S, either space(A, X ) POLYLOG(n) or there
exists an update algorithm computation such that time(A
update
) POLYLOG(n).
Of course, to prove that a streaming problem is not efficiently solvable by
verifying that the time complexity of a certain update algorithm is outside the
poly-logarithmic class is very difficult, as hardly any time-complexity lower bound
proofs exist. Therefore, the later presented proofs of streaming problems being not
efficiently solvable are made by disproving space-efficiency.

3.6. Basic Calculations
19
3.6
Basic Calculations
For several analyses, the binomial coefficient
n
k
has to be approximated. We will
provide the used approximation at this point, so that this proof and calculation
does not have to be repeated. The main content of this approximation is described
in [
] in Exercise 2.151, a) and b).
Lemma 3.1 (Dobrushkin [
+
with
0 k n:
n
k
k
n
k
n
n
k
k
· (n - k)
n-k
.
Lemma 3.2 (Dobrushkin [
+
,
2
n
2n
n
n/2
2
n
1.5n + 1
.
Lemma 3.3. For any n N
+
,
n
n/2 - 1
2
n
·
2n
2n + 4
.
Proof. We prove this lemma by starting from the result of
, which is
2
n
2n
n
n/2
. Using a simple relation between
n
n/2-1
and
n
n/2
will lead to the above
stated approximation:
n
n/2 - 1
=
n!
(n/2 - 1)! · (n/2 + 1)!
definition of binomial coefficient
=
n! · (n/2)
(n/2)! · (n/2)! · (n/2 + 1)
basic transformation
=
n
n/2
·
n/2
n/2 + 1
definition of binomial coefficent
Therefore, we can replace the left side of the claim using the equality,
n
n/2 - 1
=
n
n/2
·
n/2
n/2 + 1
transformation from above
=
n
n/2
·
n
n + 2
basic transformation
2
n
· n
2n · (n + 2)
=
2
n
·
2n
2n + 4
.
basic transformation
Next, we want to approximate
n
n/4
. The proof for this approximation is a bit
more complicated, as the stated approximation for the general case
n
k
leads to a very bad approximation, namely,
n
n
4
n
n
4
n
4
= 4
n
4
= 2
0.5n
2
0.915n
n
n
4

20
Chapter 3. Definitions
With the approach of the following, hand-made proof, we can achieved the last
approximation, namely, 2
0.915n
and one can similarly verify the referenced exercise
from the literature used in
and
Lemma 3.4. For any n 100, n = 4 · l, l N
+
,
n
n/4
0.916
n
· 2
n·(
1
2
+log
2
(4/3))
0.916
n
· 2
0.915n
.
Proof. First, we have to recall the Stirling formula for the approximation of the
faculty,
2n ·
n
e
n
n!
2n ·
n
e
n
· e
1
12n
.
With the Stirling formula we get the following approximation.
n
n/4
=
n!
n
4
! · (
3n
4
)!
(3.2)
2n · (
n
e
)
n
2
n
4
· (
(
n
4
)
e
)
n
4
· e
4
12·n
·
2
3n
4
· (
3n
4
e
)
3n
4
· e
4
12·3n
(3.3)
=
2n · (
n
e
)
n
n
2
· (
n
4e
)
n
4
· e
1
3n
·
3n
2
· (
3n
4e
)
3n
4
· e
1
9n
(3.4)
=
2n · (
n
e
)
n
3
2
n · (
n
4e
)
n
· 3
3n
4
· e
4
9n
=
2n·n
n
e
n
3n·n
n
·3
3n
4
·e
4
9n
2·(4e)
n
(3.5)
=
2n · n
n
· 2 · (4e)
n
e
n
·
3n · n
n
· 3
3n
4
· e
4
9n
=
2n · 2 · 4
n
3n · 3
3n
4
· e
4
9n
(3.6)
After replacing the binomial coefficient with the corresponding faculties
, we
approximate these faculties using the Stirling approximation from above
and
are some basic mathematical transformations to simplify the term. The
end of
leads to the possibility of transforming the double fraction to a simple
fraction in
, in which in the last term equal parts are cancelled.
Some parts of this remaining term can now be simply approximated. These are
2 2.506, and
3 · 5.442. Furthermore, for any n 100, one can easily
verify that e
4
9n
1.005. Therefore, we can use this approximations to make the term
simpler:
n
n/4
2n · 2 · 4
n
3n · 3
3n
4
· e
4
9n
approx. as above
2 · 2.506 ·
n · 4
n
5.442 · n · 3
3n
4
· 1.005
using new single approx.
0.916 ·
n · 4
n
n · 3
0.75n
basic transformations
= 0.916 ·
n · 4
0.25n
n
·
4
0.75n
3
0.75n
split 4
n
into 4
0.25n
· 4
0.75n

3.6. Basic Calculations
21
= 0.916 ·
2
0.5n
n
·
4
3
0.75n
basic transformations
Now, one can transform
4
3
= 2
log
2
(
4
3
)
2
0.415
. With this approximation we can
prove the statement of the lemma:
n
n/2
0.916 ·
2
0.5n
n
·
4
3
0.75n
approx. as above
=
0.916
n
· 2
n·(
1
2
+log
2
(4/3))
replace
4
3
as explained
0.916
n
· 2
0.915n
final, basic transformation

Details

Pages
Type of Edition
Erstausgabe
Year
2016
ISBN (PDF)
9783960675945
ISBN (Softcover)
9783960670940
File size
8.6 MB
Language
English
Publication date
2016 (December)
Keywords
Streaming Algorithm Frequency Moment Space Complexity Lower Bound Communication Complexity Hypothesis Verification Streaming Problem Additional Information

Author

Raffael Buff, born in 1990, holds a BSc and a MSc in Computer Science from the Swiss Federal Institute of Technology/ETH Zürich. His studies focused on topics such as Machine Learning, Big Data, Approximation Algorithms, Software Engineering and Distributed Systems. While studying, the author served as a project manager at ETH juniors – a student run consulting agency at ETH Zürich – from 2010 to 2012 and was head of the department IT at ETH juniors from 2011 to 2012. He holds the certificates ITIL Foundation and HERMES Advanced. On a sidenote, he also was Swiss Champion in Online Sudoku in 2006. Currently, the author works as a Business Consultant for an ICT company in Zürich, Switzerland.
Previous

Title: Using Additional Information in Streaming Algorithms
book preview page numper 1
book preview page numper 2
book preview page numper 3
book preview page numper 4
book preview page numper 5
book preview page numper 6
book preview page numper 7
book preview page numper 8
book preview page numper 9
book preview page numper 10
book preview page numper 11
book preview page numper 12
book preview page numper 13
book preview page numper 14
book preview page numper 15
book preview page numper 16
book preview page numper 17
book preview page numper 18
book preview page numper 19
book preview page numper 20
book preview page numper 21
book preview page numper 22
book preview page numper 23
book preview page numper 24
book preview page numper 25
book preview page numper 26
128 pages
Cookie-Einstellungen