Using Additional Information in Streaming Algorithms

Buff, Raffael

Using Additional Information in Streaming Algorithms

Summary

Streaming problems are algorithmic problems that are mainly characterized by their massive input streams. Because of these data streams, the algorithms for these problems are forced to be space-efficient, as the input stream length generally exceeds the available storage.
The goal of this study is to analyze the impact of additional information (more specifically, a hypothesis of the solution) on the algorithmic space complexities of several streaming problems. To this end, different streaming problems are analyzed and compared. The two problems “most frequent item” and “number of distinct items”, with many configurations of different result accuracies and probabilities, are deeply studied. Both lower and upper bounds for the space and time complexity for deterministic and probabilistic environments are analyzed with respect to possible improvements due to additional information. The general solution search problem is compared to the decision problem where a solution hypothesis has to be satisfied.

Excerpt

Contents

List of Figures

vii

List of Tables

vii

Introduction

1.1

Overview and Structure of the Thesis

. . . . . . . . . . . . . . . . .

Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .

Related Work

Definitions

3.1

Streaming Problems and Algorithms . . . . . . . . . . . . . . . . . .

3.2

Determinism and Randomization . . . . . . . . . . . . . . . . . . . .

3.3

Space and Time Complexity . . . . . . . . . . . . . . . . . . . . . . .

3.4

Approximation Ratio and Success Probability . . . . . . . . . . . . .

Approximation Ratio of a Streaming Counting Problem . . . . . . .

Success Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5

Solvability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.6

Basic Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Lower Bounds on Space Complexity

4.1

Introduction to Communication Complexity . . . . . . . . . . . . . .

4.2

Lower Bound for General Streaming Problems . . . . . . . . . . . . .

Lower Bound for an Exact Deterministic Streaming Problem

. . . .

Lower Bound for an Exact Probabilistic Streaming Problem . . . . .

Lower Bound for an Approximative Deterministic Streaming Problem

Lower Bound for an Approximative Probabilistic Streaming Problem

4.3

Lower Bounds for Hypotheses Verifications

. . . . . . . . . . . . . .

Most Frequent Item Problem

5.1

General Streaming Problem Analysis . . . . . . . . . . . . . . . . . .

Exact Deterministic Most Frequent Item Problem . . . . . . . . . . .

Exact Probabilistic Most Frequent Item Problem . . . . . . . . . . .

Approximative Deterministic Most Frequent Item Problem

. . . . .

Approximative Probabilistic Most Frequent Item Problem . . . . . .

Summary of the General Most Frequent Item Problem . . . . . . . .

5.2

Hypothesis Verification Analysis

. . . . . . . . . . . . . . . . . . . .

5.3

Analysis Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .

Number of Distinct Items Problem

6.1

General Streaming Problem Analysis . . . . . . . . . . . . . . . . . .

Exact Deterministic Number of Distinct Items Problem . . . . . . .

Exact Probabilistic Number of Distinct Items Problem . . . . . . . .

Approximative Deterministic Number of Distinct Items Problem . .

Approximative Probabilistic Number of Distinct Items Problem . . .

Summary of the General Number of Distinct Items Problem . . . . .

100

6.2

Hypothesis Verification Analysis

. . . . . . . . . . . . . . . . . . . .

102

Verification of All Possible Hypotheses . . . . . . . . . . . . . . . . .

102

Justification of One Correct Hypothesis . . . . . . . . . . . . . . . .

106

6.3

Analysis Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .

107

Conclusion

109

7.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109

7.2

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

111

7.3

Future Work

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

111

List of Figures

1.1

Thesis Overview

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.1

Illustration of the space complexity of the exact deterministic most

frequent item problem . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2

Illustration of the lower bound factor . . . . . . . . . . . . . . . . . .

5.3

Illustration of h(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.1

Illustration of the probability distribution of r

max

. . . . . . . . . . .

6.2

Illustration of the probability gap of r

max

. . . . . . . . . . . . . . .

List of Tables

5.1

Summary of general most frequent item problem . . . . . . . . . . .

6.1

Summary of general number of distinct items problem . . . . . . . .

101

vii

Chapter 1

Introduction

The title of this thesis, Using Additional Information in Streaming Algorithms,

presses for three simple questions: What is a streaming algorithm? What is meant

by the all-encompassing additional information? And what can additional informa-

tion be used for in streaming algorithms? We address these questions in the following.

What is a streaming algorithm?

A streaming algorithm processes massive data

streams to compute a certain function on these data. From a practical point of view,

a data stream means that these data are successively generated and not completely

known upfront. The streaming algorithm processes the generated data piecewise to

calculate intermediate and final results for a given streaming problem. Furthermore,

streaming algorithms may only read the streaming data once because, generally, they

cannot store the entire data stream, as the massive stream length often exceeds the

possible storage capacity. Facing these conditions, the basic question is whether a

streaming problem can be solved within practical time and space boundaries or not.

Generally, a poly-logarithmic space complexity in relation to the data stream size is

considered practical. The time needed to process a current data value is also required

to be low, as otherwise the new values generated have to be cached in the meantime,

i.e., stored, which might on the other side exceed the allowed storage size because

queues build up when incoming data stream exceeds processing capacity. The study

of streaming problems includes both the identification of streaming algorithms solving

the streaming problem with a certain space and time complexity and the analysis of

lower bounds on the required intermediate storage size to successively calculate a

certain function on the data stream. The following examples illustrate the kind of

thoughts and analysis necessary to construct an algorithm that solves a streaming

problem.

As an example, a simple streaming problem is the identification of the maximal

integer value within a massive data stream of integers. Let us assume that this data

stream contains 10

numbers, each with any integer value between one and 10

. Of

course, one can easily solve this problem with an algorithm that simply tracks the

highest already known value. At the end, after reading all 10

values, this algorithm

outputs the highest observed value between one and 10

. This algorithm would only

require small storage to encode this highest value.

A slightly more complex example would be the calculation of the k-th highest

Chapter 1. Introduction

value within this same data stream. Once again one could store the already known,

now k highest values and would therefore use k times more storage than for the simple

streaming problem above. For small k, e.g., k < 100, this is generally still practical.

A more difficult streaming problem is, e.g., the calculation of the most frequent value,

i.e., the value which occurs the most within the data stream. Here, one cannot easily

generate a streaming algorithm using sub-linear storage. Furthermore, we will later

present a proof that shows that this problem requires at least linear storage.

With the technique of communication complexity, we will be able to prove space

complexity lower bounds, which imply that any possible algorithm that solves a

certain streaming problem requires at least a certain space size. If the identified

algorithm requires the same amount of bits for its storage as the proven lower bound,

we have found an optimal algorithm in relation to the space complexity. If there is a

gap between the upper and lower bound, it might be possible to find an algorithm

with a lower space complexity.

These types of analyses - finding time and space efficient algorithms to solve a

streaming problem and proving necessary space complexities - are the core of the

study of streaming algorithms. This leads to the second question:

What is meant by the all-encompassing additional information?

Addi-

tional information may help solve a certain streaming problem. One possible usage

of this additional information is when it represents a solution hypothesis. This means

that the streaming algorithm receives a solution hypothesis upfront, e.g., that the

number 20 appears within the data stream the most with a frequency of 39 times.

Now the streaming problem is transformed from a general solution search problem

to the decision problem of verifying whether this hypothesis is true for the given

input stream. For this decision problem of hypothesis verification, one can similarly

analyze the possible space and time complexities.

For the hypothesis verification, a streaming algorithm has to verify any possible

hypothesis for a certain input stream. This type of problem is relevant, if we have a

powerful but untrusted source. Then, we may calculate a solution hypothesis with

this source solving the general streaming problem and verify the hypothesis with an

eventually more space and time efficient algorithm.

What can additional information be used for in streaming algorithms?

The concepts of streaming problems with and without additional information are

compared. Namely, the possible algorithms and proofs for both the general solution

search problem and the hypothesis verification problem determine whether the

additional information is helpful, i.e., the required time and space complexities are

lower. In some cases, the additional information is useless for solving the streaming

problem with lower algorithmic complexities.

More specifically, this thesis introduces some concepts to evaluate, whether the

hypothesis verification is significantly easier than the general solution search problem.

Here, easier means that the algorithm to verify the hypothesis is faster or requires

less storage than the algorithm for the general solution search problem requires in the

optimal case. Additionally, some streaming problems are stated where the hypothesis

verification is equally difficult as the general streaming problem, which means that

the additional information is not useful for the streaming algorithm at all.

1.1. Overview and Structure of the Thesis

1.1

Overview and Structure of the Thesis

The goal of this thesis is to analyze the impact of additional information (more

specifically, a hypothesis of the solution) on the algorithmic space complexities of

several streaming problems.

To this end, different streaming problems are analyzed and compared. The two

problems most frequent item and number of distinct items, with many configurations

of different result accuracies and probabilities, are deeply studied. Both lower and

upper bounds for the space and time complexity for deterministic and probabilistic

environments are analyzed with respect to possible improvements due to additional

information. The general solution search problem is compared to the decision problem

where a solution hypothesis has to be satisfied.

Figure 1.1

illustrates this approach, which is explained in more detail in the

following. Based on this illustration, the following different layers are described:

Streaming problems (gray boxes), decision vs. search problems (dark blue boxes),

deterministic vs. probabilistic settings (light blue boxes), exact solutions vs. ap-

proximations (yellow boxes) and lower vs. upper bounds of algorithmic complexities

(orange boxes).

Figure 1.1. Thesis Overview

Streaming Problems:

The following streaming problems are analyzed. The goal

of the most frequent item problem is to identify the value which is most frequent

within a data stream x = (x

, x

, . . . , x

). The streaming problem number of distinct

items identifies the amount of different values, i.e., the size of the subset of the data

stream without any duplications. As sketched in the last section of this thesis, the

introduced approach and technique to analyze upper and lower bounds that are used

to analyze the two streaming problems can be used for further streaming problems

as well.

Chapter 1. Introduction

Decision vs. Search Problems:

Our aim is to prove or disprove the possible

benefits of additional information (in the form of a solution hypothesis) for solving

one of the mentioned streaming problems. Therefore, the best possible algorithms

with respect to space and time complexity are identified for both the search problem,

where the optimal solution has to be found and for the decision problem, where one

has to verify or refuse a certain solution hypothesis. One can verify that the decision

problem is not more complex than the search problem because, for the decision

problem, one can always compute the solution as in the search problem and compare

it to the hypothesis. Therefore, the question is, whether the decision problem (with

additional information, i.e., a hypothetical solution) has a smaller complexity, and if

so, by how much.

Deterministic vs. Probabilistic Approaches:

Deterministic algorithms differ

from probabilistic ones in both the result accuracy and the required time and

space complexity. Therefore, the impact of additional information (i.e., a solution

hypothesis) is compared for both deterministic and probabilistic algorithms.

Exact Solutions vs. Approximations:

Streaming optimization problems such

as the most frequent item problem may allow an approximation of the solution.

Identifying an exact solution and allowing approximative results yield fundamentally

different implications on the time and space complexities of its streaming algorithms.

Lower vs. Upper Bounds:

For a certain streaming problem and its configuring

setting (decision or search problem, deterministic or probabilistic approach, and

exact solution or approximation), the best possible algorithms with respect to space

and time complexity are identified. Concrete algorithms deliver upper bounds. Fur-

thermore, formal proofs lead to space complexity lower bounds for certain streaming

problems.

Structure of the Thesis

After this introduction,

Chapter 2

discusses an overview of the literature providing

similar topics.

Chapter 3

introduces and describes the relevant definitions.

Chapter 4

covers the approaches to prove space complexity lower bounds.

Chapter 5

and

Chapter 6

contain the detailed analysis of the most frequent item problem and the

distinct item problem. The summary of the results, the thesis conclusion and an

outlook on possible future work is given in

Chapter 7

Chapter 2

Related Work

In this chapter, we introduce some relevant papers, articles and books for the topic of

this thesis. We first state some related work on streaming problems and algorithms.

To prove space complexity lower bounds, we use the technique of communication

complexity. Therefore, we state some resources about communication complexity

and its usage on other problems. At last, we list five papers on the two analyzed

streaming problems, the most frequent item problem and the number of distinct items

problem.

Streaming Problems and Algorithms:

There are different resources addressing

the area of streaming problems and data streams. Aggarwal [

Agg07

] gives a brief

overview of the characteristics of a data stream and lists several interesting streaming

problems. Babcock et al. [

BBD

] study data streams from a more practical point

of view, including optimized database queries and time measures. A more formal and

theoretical analysis of some streaming problems is given by Muthukrishnan [

Mut05

The doctoral thesis by Prakash about Efficient Delegation Algorithms for Outsourcing

Computations on Massive Data Stream [

Pra15

] is a recommended literature because

it introduces the relevant parts of streaming problems, their characteristics and their

analysis including lower-bound proofs.

Communication Complexity:

For a significant analysis of streaming problems,

we have to prove space complexity lower bounds, besides designing efficient streaming

algorithms. For these proofs, we will use the technique of communication complexity,

which was originally introduced by Yao [

Yao79

]. The principles of communication

complexity are introduced in detail later on, namely, in

Chapter 4

. A more broad

and formal introduction is given by Hromkovic [

Hro97

]. A more practical resource

on communication complexity is given by Roughgarden [

Rou15

], which focusses on

communication complexity for algorithm designers. Halstenberg and Reischuk [

HR88

]

introduce different communication complexity models and Kremet et al. [

KNR01

]

focus on one of these models, the one-round communication complexity, which is also

the basis for the lower-bound proofs of the two analyzed streaming problems. Babai

et al. [

BFS86

] introduce different communication complexity classes. Yao [

Yao83

Paturi and Simon [

PS84

], as well as Ablayev [

Abl96

] wrote useful papers on proving

lower bounds in a probabilistic setting. For our lower bound proofs in a randomized

setting, we will use the theorems by Ablayev [

Abl96

Chapter 2. Related Work

Based on the communication complexity theory, there are different papers il-

lustrating the usage of this technique.

As an example, Kalyanasundaram and

Schnitger [

KS92

] analyze the probabilistic communication complexity of set intersec-

tion. The distributional complexity of disjointness is analyzed by Razborov [

Raz92

Munro and Paterson [

MP80

] focus on selection and sorting with limited storage

and the approximate counting of inversions in a data stream is studied by Ajatai

et al. [

AJKS02

]. Indyk and Woodruff [

IW05

] analyze the optimality of frequency

moment algorithms.

Most Frequent Item and Number of Distinct Items Problem:

The two

studied streaming problems of this thesis are analyzed in several resources. Carmode

and Hadjieleftheriou [

CH09

] as well as Manku and Motwani [

MM12

] give concrete

implementations of algorithms for the most frequent item problem. The paper

Space Complexity of Approximating the Frequency Moments by Alon et al. [

AMS99

]

introduces an efficient algorithm for the number of distinct items problems and

analyzes the space complexity lower bounds of both streaming problems for a

deterministic and probabilistic setting. Karp et al. [

KSP03

] as well as Trevisan and

Williams [

TW12

] state further lower bounds for the most frequent item problem for

some conditions of the input values.

The relevant results of these resources are gathered in detail in the corresponding

chapters of the streaming problems, namely in

Chapter 5

for the most frequent item

problem and in

Chapter 6

for the number of distinct items problem.

Chapter 3

Definitions

Before we can start analyzing concrete streaming algorithms with respect to space

and time complexity, we first have to formally define what a streaming algorithm is.

Yet before, a definition for a streaming problem has to be provided. Additionally,

in the following chapters we will often use special cases of streaming problems or

algorithms these special cases are also introduced here. Therefore, this chapter

contains several basic definitions for streaming problems and algorithms.

First, the general definitions for streaming problems, algorithms and their spe-

cial cases are given (

Section 3.1

Section 3.2

). Then, for streaming algorithms, we

introduce how the time and space complexities are identified (

Section 3.3

). The

problem characteristics success probability and approximation ratio are explained

(

Section 3.4

). Definitions on solvability (

Section 3.5

) and a few basic calculations for

further analysis (

Section 3.6

) conclude this chapter.

3.1

Streaming Problems and Algorithms

In this section, streaming problems and algorithms are introduced and formally

defined. First, the general definition of a streaming problem is explained and then,

two special cases are defined: The streaming counting problem and the streaming

decision problem. All analyzed streaming problems can be reduced to one of these

two. The general definition of the streaming problem is introduced to show the

similarity of these two cases and to avoid duplications of definitions and theorems

that are valid for both cases. These formal definitions are required to allow the

mathematical analysis of space complexity lower bounds.

As outlined in the introduction (

Chapter 1

), a streaming problem is mainly

characterized by the enormous size of the data stream as its input. As we will see in

the following definition, this input stream is defined as a sequence of n input values,

where n represents a very large number. The set of feasible output values defines

which outputs are allowed for every possible input stream. Finally, the problem

function defines which one of the feasible output values is the optimal or correct one

for a certain input stream. All of this together leads to the following definition.

Chapter 3. Definitions

Definition 3.1. A streaming problem S is a triple (X , Y, f ) containing a class

of input streams X , where each input stream x

X is a sequence of n input values,

i.e., x

= (x

, . . . , x

), with n N

. The second element of the triple is a class

of feasible output sets Y = {Y

, . . . , Y

} with d = |X |, where every input stream

X has a corresponding feasible output set Y

. Additionally, the streaming

problem contains a problem function f : X Y

, where, for every input stream

x X , one or more values from the feasible output set are indicated as optimal or

correct output value, namely f (x) Y

With this general definition of a streaming problem, one can either define a

classical optimization problem or a decision problem. For optimization problems, one

has to be defined how good non-optimal output values are, such that this metric can

be optimized. For a decision problem, the problem function can simply be interpreted

as determining the correct decisions.

As an illustration of this definition, the simple streaming problem from above

(

Chapter 1

), in which the task was simply to identify the maximum integer value

within the input stream, could be formally defined as follows.

· Input streams: Each input stream x = (x

, . . . , x

) X contains a series of n

integers between 0 and a fixed m N

, namely, x

{0, . . . , m}.

· Feasible output sets: For every input stream x X , the corresponding output

set allows every integer between 0 and m, namely Y

= {0, . . . , m} Y.

· Problem function: For every certain input stream x X , it defines the optimal

or correct output value from the feasible output set Y

= {0, . . . , m}, which is

f (x) = max{x

| 1 i n}.

With this definition of X , Y, and f , we defined a streaming problem S = (X , Y, f ),

that searches for the maximum value in an input stream. The introduced formalism

seems to be too complicated for such a simple problem. But for more complicated

problems, this level of formalism is required to identify algorithms that solve it and,

especially, to prove some space complexity lower bounds.

Now, we will distinguish two special cases of the general streaming problem, the

streaming counting problem and the streaming decision problem.

First, we will introduce the streaming counting problem. In counting problems,

we deal with series of integers as input streams and integers as output values, and

the aim is always to evaluate some counting function. Some examples of streaming

counting problems are the sum of all input values, their median, or the most frequent

item.

A streaming counting problem has input streams with n integer values from the

set from 1 to m N

. The resulting counting value of an input stream is an integer

between 0 and a fixed value s(n, m), which depends on the input stream length and

its values. This result size is specific to each streaming problem.

We will describe a few examples of how s(n, m) could be defined to illustrate

its usage. To find a maximum number and its position within the input stream,

s(n, m) = max{n, m} is a reasonable result size, as higher values are not possible. To

identify the most frequent item within the input streams, s(n, m) = n seems suitable,

3.1. Streaming Problems and Algorithms

as an item cannot occur more often than n times. To sum up all numbers within

the input stream, s(n, m) = nm is a meaningful result size. Now, the feasible output

values of an input stream are any possible counting values between 0 and s(n, m).

The problem function is just the counting function on the input stream, which

is the core of a certain counting problem. The goal of a streaming algorithm is to

produce this optimal value or, in some cases, a good approximation of this optimal

value. We can capture this in the following definition.

Definition 3.2. A streaming counting problem S

= (X , Y, f, s) is a streaming

problem S = (X , Y, f ) together with a result-size function s(n, m) N

with the

following additional constraints: The input streams x = (x

, . . . , x

) X have input

values x

{1, . . . , m}, with m N

. For every feasible input stream x X , the

possible output values are y Y

with y {0, . . . , s(n, m)}. The problem function

defines, for every input stream x X , exactly one counting value y Y

as the

optimal counting result.

An example of a streaming counting problem is the frequency moment.

Definition 3.3. For every k N, we define the frequency moment F

of a

streaming counting problem S

= (X , Y, f, s) as

j=1

)

where f

counts the frequency of the item j in the input stream, namely,

i | 1 i n and x

= j

counts the number of distinct items within the input stream, F

counts the

input stream length, namely F

(x) = n and F

is defined as the most frequent item,

namely, F

:= max

j=1

Note that, indeed, we have a relation between the general definition of the

frequency moment f

for f N and F

because

---

= max

1jm

Besides the streaming counting problem, the second special case is the streaming

decision problem. The only constraint for this streaming problem type is that the

problem function defines exactly one feasible output value as correct. A general

streaming problem might have several optimal output values, as, e.g., the streaming

problem answering the question: Which is the shortest path from one point to

another? Therefore, the streaming decision problem is defined as follows.

Definition 3.4. A streaming decision problem S

decision

= (X , Y, f ) is a stream-

ing problem with the constraint that the problem function defines, for every possible

input stream, exactly one feasible output value as the correct one.

Chapter 3. Definitions

A binary streaming decision problem allows only two different feasible output

values, namely Y

:= {0, 1}. For every input stream, either 0 or 1 is the correct

decision.

With the general and the two specialized definitions of a streaming problem, we

will now discuss streaming algorithms, which may solve a certain streaming problem.

Just like any classical algorithm, a streaming algorithm also produces a feasible

output value for every possible input stream. As known from the introduction,

streaming problems have massive data streams, which can only be read once, as

the size of the entire input generally exceeds the storage limit. Now, a streaming

algorithm can be understood as a series of n update-algorithm computations, in

which every update-algorithm computation processes one input value and modifies

the intermediate storage. This intermediate storage is a binary string of maximal size

d N

that has in total 2

different states. Of course, a concrete implementation of

a streaming algorithm does not have to store all d bits, e.g., when the first half of the

bit string is only containing zeros. Therefore, the set of all storage states contains

the empty bit string , the bit string "1" and all possible combinations of bit strings

of length at most d which have a "1" at the beginning. Formally, this cache set is

defined as:

C = {, 1} (1, x) | x {0, 1}

and 1 i d - 1 .

As required, the different storage states have a size of 2

, because,

|C| = |{, 1}| +

d-1

i=1

= 2 + (2

- 2) = 2

The initial cache state is the empty bit string . During the algorithm execution,

this cache can be used to store intermediate results. Finally, after considering all

n input values, an output algorithm transforms the last cache state into a feasible

output value of the streaming problem. This leads to the following definition of a

streaming algorithm.

Definition 3.5. A streaming algorithm A = (A

update

, A

output

) for a streaming

problem S = (X , Y, f ) contains an update algorithm A

update

such that A

update

i-1

, x

) =

and an output algorithm A

output

such that A

output

) = y Y.

For an input stream x X , the algorithm A computes a feasible output value

y = A(x) using the following approach: First, the cache c

C is empty, namely

= . Iteratively, all n inputs values are processed with the update algorithm

update

, namely, for all i (1, . . . , n), c

:= A

update

i-1

, x

). Then, the output value

y Y

is generated as y := A

output

This means, the streaming algorithm A computes a series of intermediate cache

storages c

The first update algorithm computation processes an empty

cache state c

= and the first input value x

, . . . , x

) = x, namely

update

, x

) = c

. The next update computation processes the second input

value, namely A

update

, x

) = c

. This is analogously repeated, until the last input

value is processed, namely A

update

n-1

, x

) = c

We write A

update

, x) = c

i+k

to indicate that the update algorithm is executed

k times on the starting cache state c

, using the input values (x

i+1

, . . . , x

i+k

). The

3.2. Determinism and Randomization

input values have to be a substream of the input stream, namely (x

i+1

, . . . , x

i+k

) is

a substream of (x

, . . . , x

) = x. One can therefore conclude that,

A(x) = A

output

) = A

output

update

, x)) = A

output

update

(, x)).

We say that a streaming algorithm A solves a certain streaming problem S if A

is a streaming algorithm for S as defined above. This implies that for every input

stream x X of the streaming problem, the streaming algorithm computes a feasible

output value, namely A(x) Y

for all x X .

We assume, that the streaming algorithm knows the input stream size n.

The definition of streaming problem and streaming algorithm are leaned on the

idea by Komm [

Kom15

], Prakash [

Pra15

], and Muthukrishnan [

Mut05

]. The definition

of the frequency moment is based on the results by Alon et al. [

AMS99

3.2

Determinism and Randomization

Streaming problems can be solved by streaming algorithms that are either deter-

ministic or probabilistic. For any input stream, a deterministic streaming algorithm

produces the same intermediate cache states and final results in every computation.

Therefore,

Definition 3.5

defines a deterministic streaming algorithm A

det

. The

next definition will demonstrate the essential difference between deterministic and

probabilistic streaming algorithms.

Probabilistic algorithms use random decisions in their computations. Random-

ization is modelled by a random bit string. This random bit string contains a

finite number of random bits, which are used for the algorithm's computation. The

number of random bits depends on the input stream size n. Therefore, we define

the length of the random bit string of at most d(n) N

. For the computation of

a probabilistic algorithm A

prob

, both update and output algorithm may use some

random bits b B =

d(n)

l=0

{0, 1}

. The number of consumed random bits at an

update step or in the output algorithm can vary, based on the input stream and on

previous random decisions. But the sum of all consumed random bits is limited by

d(n). The consequence of the use of these random bits is that the algorithm does

not necessarily always produce the same cache states or the same final results. We

define a probabilistic streaming algorithm as follows.

Definition 3.6. A streaming algorithm A

prob

for a streaming problem S = (X , Y, f )

is called probabilistic if the algorithm uses, within the update or output computations,

a random bit string b of length l d(n) N

, namely b {0, 1}

. The term A

prob

(x)

defines a probability distribution over all feasible output values y Y

, namely, for

all y Y

, we have 0 Pr[y = A

prob

(x)] 1 and

Pr[y = A

prob

(x)] = 1.

Note, that Pr[. . . ] is the used notation to indicate the probability of a certain

event according to the given probability distribution.

As described, probabilistic algorithms do not necessarily always produce the

same results for a fixed input stream x X . We define the deterministic streaming

algorithm A

that uses an already generated random bit string b and produces the

output A

(x) = A(x, b) = y Y

. One can see that A

prob

(x) computes the same

Chapter 3. Definitions

probability distribution as the set of deterministic algorithms A(x, b) where the

bit string b B is chosen uniformly at random. Using a random bit string of size

l d(n), the probability of producing a certain output value y Y

is defined by the

probability distribution and can be calculated by identifying the number of different

random bit strings which lead to the corresponding output value. Namely,

Pr[y = A

prob

(x)] =

|{b | b B and y = A(x, b)}|

For a probabilistic streaming algorithm A

prob

for a decision problem S

decision

the two special cases one-sided and two-sided Monte Carlo algorithm are defined.

Monte Carlo algorithms require the streaming decision problem to have binary

decision values, i.e., there are only two possible decision values Y

= {0, 1}. In the

following, these two types of algorithms are introduced.

Monte Carlo Algorithms

We call a probabilistic algorithm A

prob

for a binary streaming decision problem

decision

, in which there are only two decision possibilities, namely Y

= {0, 1} for all

input streams x X , to be a Monte Carlo algorithm if we have a certain required

success probability. This probabilistic streaming algorithm has to identify the correct

decision with a probability of at least

. We further distinguish between the one-sided

Monte Carlo algorithm and the two-sided Monte Carlo algorithm.

A one-sided Monte Carlo algorithm allows false positives, but never false negatives.

This means that if the correct decision is one (f (x) = 1), then the algorithm always

has to indicate this result. However, if the correct decision is zero (f (x) = 0), then

the algorithm is required to produce this result at least in every second computation

on average. This leads to the following definition.

Definition 3.7. Let A

prob

be a probabilistic streaming algorithm for a binary stream-

ing decision problem S

decision

with Y

= {0, 1}. We call A

prob

a one-sided Monte

Carlo algorithm if, for all x X with f (x) = 1, Pr[f (x) = A

prob

(x)] = 1 and for

all x X with f (x) = 0, Pr[f (x) = A

prob

(x)]

Two-sided Monte Carlo algorithms are similar to one-sided algorithms, but now

both false positives and false negatives are allowed. However, the probability of

identifying the correct decision has to be at least

. Therefore, we have the following

definition.

Definition 3.8. Let A

prob

be a probabilistic streaming algorithm for a binary stream-

ing decision problem S

decision

with Y

= {0, 1}. We call A

prob

a two-sided Monte

Carlo algorithm if, for all x X , Pr[f (x) = A

prob

(x)]

The definition of a probabilistic streaming algorithm is leaned on the results by

Hromkovic [

Hro97

], Definition 2.5.5.1 on randomized protocols. The definition on 1-

and 2-sided Monte Carlo algorithms are similar to the Definition 2.5.5.4 in [

Hro97

]

3.3. Space and Time Complexity

3.3

Space and Time Complexity

A streaming algorithm A for a streaming problem S can be analyzed with respect to

its space and time complexity. The space complexity states the maximal number of

required bits in the storage over the entire computation. As stated in the definition of

a streaming algorithm (

Definition 3.5

), the algorithm is a composition of an update

algorithm, which is executed n times for every input value and an output algorithm.

Each update algorithm computation produces an intermediate result c

C, which is

stored in the cache. The space complexity is the maximal number of bits required to

store the intermediate results and the output value in the cache. We use the notation

of |c

| and |y| to represent the required number of bits to store the content of the

cache state c

and the output value y. Note, that |c

| and |y| are not the absolute

values, as both do not have to be numbers. Formally, we define the space complexity

of a streaming algorithm A and a certain input stream x X as follows,

space(A, x) = max |c

|, |c

|, . . . , |c

n-1

|, |y|

= max |A

update

(, x

)|, |A

update

(, x)|, |A

update

(, x)|,

. . . , |A

update

(, x)|, |A(x)| .

The overall space complexity of a streaming algorithm is simply the worst case

space complexity on all possible input streams with the corresponding stream length

n, i.e.,

space(A) = max space(A, x) | x X .

With the space complexity analysis of a streaming algorithm, one has an upper

bound of the space complexity of the streaming problem S. The space complexity

of the streaming problem is the space complexity of the best streaming algorithm

solving the problem, i.e.,

space(S) = min space(A) | A solves S .

We conclude with this definition of the space complexity approximation.

Definition 3.9. For any streaming algorithm A that solves S, we define the space

complexity as follows.

space(S) space(A) = max space(A, x) | x X S .

We have seen how the space complexity of a streaming algorithm A can be

evaluated. On the other hand, the time complexity indicates the required computation

time. Time complexity analysis is always done using the big-O notation. Basic

mathematical operations such as comparison, addition, or multiplication on values

with a fixed size are made within O(1) time steps. These basic mathematical

operations on numbers relative to n, i.e., i {1, . . . , n}, require a time complexity

of O(log(n)). Similarly, computations on the input values of a streaming counting

problem with input values x

{1, . . . , m} have a time complexity of O(log(m)).

Chapter 3. Definitions

Read or write requests to the storage of (current) size k require O(log(k)) time steps

for a random-access storage.

With these time measures, the time complexities of the update and output-

algorithms can be identified. Formally, time(A

update

) and time(A

output

) are the sum

of the individual basic time measures. The update time complexities indicate the

required time for processing a single input value. For the time complexity analysis

of a streaming algorithm, we are interested in two different measures. First, the

worst-case time complexity for a single update algorithm basically indicates how fast

the individual input values of an input stream can be evaluated or, in other words,

how fast they might arrive.

Definition 3.10.

We define the update time complexity of a streaming algorithm

as the worst-case time complexity of a single update algorithm execution, which is,

update-time(A) = max{time(A

update

i-1

, x

)) |

1 i n, c

i-1

C and x

x X }.

Second, the overall time complexity is the sum of all update time complexities

and the output time complexity. For a fixed input stream x X , this is:

time(A, x) =

i=1

time(A

update

i-1

, x

))

+ time(A

output

)).

We state the second definition on the time complexity.

Definition 3.11. The overall time complexity is the worst-case time complexity

for any possible input streams x X :

time(A) = max{time(A, x) | x X }.

For probabilistic algorithms, the worst-case time complexities state the maximal

required time for any possible random decision b B.

3.4

Approximation Ratio and Success Probability

In this section, the approximation ratio for streaming counting problems is introduced,

which defines the approximation of an output value, compared to the optimal output

value. Afterwards, the success probability for probabilistic algorithms is defined

that renders the probability that the algorithm produces the optimal output value

or one within an acceptable approximation ratio. With approximation ratio and

success probability, the deterministic, exact streaming problem can be modified to

get a deeper analysis and a more differentiated understanding of the usefulness of

additional information.

Approximation Ratio of a Streaming Counting Problem

A streaming algorithm A (a deterministic or probabilistic one) for a streaming

counting problem S

(see

Definition 3.2

) may produce results with a tolerated

3.4. Approximation Ratio and Success Probability

approximation of the optimal output value. We will first introduce how one can

identify the approximation ratio of a given streaming algorithm for a streaming

maximization problem.

Basically, the approximation ratio stands for the fraction between the optimal

result and the output value produced by the streaming algorithm. The approximation

ratio may have any value R, 1, where an approximation ratio of = 1

implies an optimal result. In more general terms, the lower the approximation ratio,

the more accurate the result.

Suppose we are given a deterministic streaming algorithm A

det

that solves a

streaming counting problem S

= (X , Y, f, s). Then, for a certain input stream

x X and the output value A

det

(x) = y Y

, we define the approximation value

for this specific input stream as the fraction between the produced output value and

the optimal value, namely,

approx(A

det

, x) = max

f (x)

det

(x)

det

(x)

f (x)

The max-clause is used for both the approximation value below the optimal

output value (first term) and the approximation values above the optimal output

value (second term). We observe that if and only if the streaming algorithm pro-

duces the optimal value, the approximation value is 1. Furthermore, we define the

approximation ratio over all input streams as follows.

Definition 3.12. Let A

det

be a deterministic streaming algorithm for a streaming

counting problem S

= (X , Y, f, s). The approximation ratio is the maximal

approximation value among all possible input streams, which is,

approx-rate(A

det

, X ) = max{approx(A

det

, x) | x X }

= max

f (x)

det

(x)

det

(x)

f (x)

x X .

We observe that this definition of the approximation ratio may always allow

approximative output values both above and below the optimal output value. Some-

times, only a 1-sided approximation with feasible output values below the optimal

result is allowed. Then, we define the approximation value only with output values

below the optimal result as valid. This is formally defined as follows.

Definition 3.13. Let A

det

be a deterministic streaming algorithm for a streaming

counting problem S

= (X , Y, f, s). The 1-sided approximation ratio is the

maximal 1-sided approximation value among all possible input streams, which is,

1-sided-approx-rate(A

det

, X ) = max{1-sided-approx(A

det

, x) | x X } with

1-sided-approx(A

det

, x) =

f (x)

det

(x)

if A

det

(x) f (x)

else.

(3.1)

This definition of the 1-sided approximation ratio assumes that the streaming

counting problem is a maximization problem. With the second case of

(3.1)

, we

ensure feasible output values below the optima one, as the approximation ratio is

infinite and, as a consequence, useless.

Chapter 3. Definitions

Corollary 3.1. For a certain deterministic streaming algorithm A

det

that solves a

streaming counting problem S

with an approximation ratio approx-rate(A

det

, X ) = 1,

det

always produces an optimal output result for every possible input stream x X .

For a probabilistic streaming algorithm A

prob

, we can determine a success proba-

bility for any fixed approximation ratio 1, which is introduced in the following.

For a single computation, which uses a certain random bit string b B, we can

similarly define the approximation value,

approx(A

prob

, x, b) = max

f (x)

prob.

(x, b)

prob

(x, b)

f (x)

Based on this approximation value of a single computation, we will see in the

following how one can evaluate the lowest approximation ratio for a fixed success

probability such that the success probability is still given. Some streaming algorithms

have the possibility to produce an arbitrarily good approximation ratio = 1 + for

any small 0.

Success Probability

Besides the approximation ratio, we introduce the success probability. For a proba-

bilistic streaming algorithm A

prob

that solves a certain streaming counting problem

, we define the success ratio as the probability that A

prob

produces a result within

the tolerated approximation ratio. We define, for a fixed input stream x X , the set

of all output values within the approximation ratio 1 as ~

, which is,

y | y Y

and

f (x)

y f (x) · .

Given a probabilistic streaming algorithm A

prob

for a streaming counting problem

and a fixed approximation ratio 1, the success probability to produce

an output value within the tolerated approximation is the sum of all individual

probabilities, that A

prob

outputs any ~

. Therefore,

success(A

prob

, x, ) = pr

f (x)

prob

(x) f (x) ·

pr A

prob

(x) = ~

Similarly to the approximation ratio, we can define the success probability of a

probabilistic streaming algorithm for a streaming counting problem as the lowest

success ratio among all possible input streams.

Definition 3.14. For a probabilistic streaming algorithm A

prob

solving a streaming

counting problem S

= (X , Y, f, s) and an approximation ratio 1, the success

probability for the streaming counting problem is defined as,

success-probability(A

prob

, X , ) = min{success(A

prob

, x, ) | x X }.

3.5. Solvability

If and only if the success probability is 1, then the probabilistic streaming algo-

rithm always produces a result within the allowed approximation ratio. We use the

notation S

to indicate a streaming problem with an approximation ratio of 1

and a success probability of p 1.

Similarly to the case of a streaming counting problem, we can define the success

probability for a probabilistic streaming algorithm A

prob

on a streaming decision

problem S

decision

. As we do not have approximation ratios because streaming decision

problem just distinguish between correct and wrong decisions, the success probability

just measures the minimum probability to produce the correct result over all possible

input streams.

Definition 3.15.

For a probabilistic streaming algorithm A

prob

solving a streaming

decision problem S

decision

= (X , Y, f ), the success probability for the streaming

decision problem is defined as,

success-probability(A

prob

, X ) = min Pr[A

prob

(x) = f (x)] | x X .

As stated before, studies of approximation ratios of 1 for streaming decision

problems are meaningless. Here, we are only interested in the success probabilities of

exact solutions using the same formulas as above but simply with = 1. Generally,

a probabilistic streaming algorithm A

prob

for a streaming decision problem S

decision

is required to have a fixed success probability of at least p

. Otherwise it is

possible that a wrong decision may be amplified more often than the correct one. An

exception is the 1-sided Monte Carlo algorithm, as there are only two possible values

and the wrong decision cannot be amplified, as false negatives are not allowed.

In a deterministic environment, the study of different success probabilities other

than 1 is obviously not meaningful, as deterministic streaming algorithms do not use

any random bits.

3.5

Solvability

In this section, some measures are introduced by which a certain streaming algorithm

can be identified as practical or useful. As explained in the introduction, the algorithm

cannot store the entire input stream and compute the entire function in the last final

output algorithm. It was also stated that the required space complexity should be

sub-linear. But what does this mean in detail?

We will recall the definition from the well-known complexity class POLYLOG and

use this measure to classify the required space complexity of a streaming algorithm

as space-efficient (if it is poly-logarithmic) or not space-efficient (if it is not poly-

logarithmic). If a space-efficient algorithm for a certain streaming problem has a

poly-logarithmic time complexity for each update algorithm computation, we then

call the corresponding streaming problem efficiently solvable. If we can prove that

any possible streaming algorithm that solves a certain streaming problem has either

a not space-efficient space complexity or any update algorithm computation has

a time complexity outside the poly-logarithmic class, we then call this streaming

problem not efficiently solvable.

Chapter 3. Definitions

Later, this classification of streaming problems into efficiently solvable and not

efficiently solvable will be very helpful to distinguish different streaming problems

and the possible advantage of additional information like hypothesis verification.

Now we will introduce these categories more formally.

First, we recall: A streaming algorithm A is said to solve a streaming problem

with an approximation ratio 1 and a success probability p 1 if, for every

possible input stream x X , it writes an output value within the approximation

ratio with at least the according success probability. Streaming decision problems

always have an approximation ratio of = 1 to ensure correct decisions.

As a basis: The poly-logarithmic complexity class POLYLOG(n) includes all al-

gorithms A, where there exists a d N

, such that time-complexity(A) O(log

(n))

and space-complexity(A) O(log

(n)). The word polylogarithmic is standardized by

NIST and , for example, listed in the dictionary of algorithms and data structures

by Sant [

San04

Definition 3.16. A streaming algorithm A for a streaming problem S = (X , Y, f )

with input streams x X of length n = |x| is said to be space-efficient, if

space(A, X ) POLYLOG(n). On the other hand, if space(A, X ) POLYLOG(n),

the streaming algorithm A for S is called not space-efficient.

A streaming problem S is said to be efficiently solvable if there exists a space-

efficient streaming algorithm A that solves the streaming problem S and each update

algorithm computation has time(A

update

) POLYLOG(n) and the output algorithm

has time(A

output

) POLYLOG(n).

A streaming problem S is said to be not efficiently solvable if, for any possible

streaming algorithm A that solves S, either space(A, X ) POLYLOG(n) or there

exists an update algorithm computation such that time(A

update

) POLYLOG(n).

Of course, to prove that a streaming problem is not efficiently solvable by

verifying that the time complexity of a certain update algorithm is outside the

poly-logarithmic class is very difficult, as hardly any time-complexity lower bound

proofs exist. Therefore, the later presented proofs of streaming problems being not

efficiently solvable are made by disproving space-efficiency.

3.6. Basic Calculations

3.6

Basic Calculations

For several analyses, the binomial coefficient

has to be approximated. We will

provide the used approximation at this point, so that this proof and calculation

does not have to be repeated. The main content of this approximation is described

in [

Dob10

] in Exercise 2.151, a) and b).

Lemma 3.1 (Dobrushkin [

Dob10

], Exercise 2.151, b). For any n, k N

with

0 k n:

· (n - k)

n-k

Lemma 3.2 (Dobrushkin [

Dob10

], Exercise 2.151, a). For any n N

n/2

1.5n + 1

Lemma 3.3. For any n N

n/2 - 1

2n + 4

Proof. We prove this lemma by starting from the result of

Lemma 3.2

, which is

n/2

. Using a simple relation between

n/2-1

and

n/2

will lead to the above

stated approximation:

n/2 - 1

(n/2 - 1)! · (n/2 + 1)!

definition of binomial coefficient

n! · (n/2)

(n/2)! · (n/2)! · (n/2 + 1)

basic transformation

n/2

n/2 + 1

definition of binomial coefficent

Therefore, we can replace the left side of the claim using the equality,

n/2 - 1

n/2

n/2 + 1

transformation from above

n/2

n + 2

basic transformation

· n

2n · (n + 2)

Lemma 3.2

2n + 4

basic transformation

Next, we want to approximate

n/4

. The proof for this approximation is a bit

more complicated, as the stated approximation for the general case

(

Lemma 3.1

)

leads to a very bad approximation, namely,

= 4

= 2

0.5n

0.915n

Chapter 3. Definitions

With the approach of the following, hand-made proof, we can achieved the last

approximation, namely, 2

0.915n

and one can similarly verify the referenced exercise

from the literature used in

Lemma 3.1

and

Lemma 3.2

Lemma 3.4. For any n 100, n = 4 · l, l N

n/4

0.916

· 2

n·(

+log

(4/3))

0.916

· 2

0.915n

Proof. First, we have to recall the Stirling formula for the approximation of the

faculty,

2n ·

· e

12n

With the Stirling formula we get the following approximation.

n/4

! · (

(3.2)

2n · (

)

· (

(

)

· e

12·n

· (

)

· e

12·3n

(3.3)

2n · (

)

· (

)

· e

· (

)

· e

(3.4)

2n · (

)

n · (

)

· 3

· e

2n·n

3n·n

·3

·e

2·(4e)

(3.5)

2n · n

· 2 · (4e)

3n · n

· 3

· e

2n · 2 · 4

3n · 3

· e

(3.6)

After replacing the binomial coefficient with the corresponding faculties

(3.2)

, we

approximate these faculties using the Stirling approximation from above

(3.3)

(3.4)

and

(3.5)

are some basic mathematical transformations to simplify the term. The

end of

(3.5)

leads to the possibility of transforming the double fraction to a simple

fraction in

(3.6)

, in which in the last term equal parts are cancelled.

Some parts of this remaining term can now be simply approximated. These are

2 2.506, and

3 · 5.442. Furthermore, for any n 100, one can easily

verify that e

1.005. Therefore, we can use this approximations to make the term

simpler:

n/4

2n · 2 · 4

3n · 3

· e

approx. as above

2 · 2.506 ·

n · 4

5.442 · n · 3

· 1.005

using new single approx.

0.916 ·

n · 4

n · 3

0.75n

basic transformations

= 0.916 ·

n · 4

0.25n

0.75n

split 4

into 4

0.25n

· 4

0.75n

3.6. Basic Calculations

= 0.916 ·

0.5n

0.75n

basic transformations

Now, one can transform

= 2

log

(

)

0.415

. With this approximation we can

prove the statement of the lemma:

n/2

0.916 ·

0.5n

0.75n

approx. as above

0.916

· 2

n·(

+log

(4/3))

replace

as explained

0.916

· 2

0.915n

final, basic transformation

Details

Pages

Type of Edition

Erstausgabe

Publication Year

2016

ISBN (Softcover)

9783960670940

ISBN (PDF)

9783960675945

File size

8.6 MB

Language

English

Publication date

2016 (December)

Keywords

Streaming Algorithm Frequency Moment Space Complexity Lower Bound Communication Complexity Hypothesis Verification Streaming Problem Additional Information

Product Safety

Anchor Academic Publishing

Author

Raffael Buff (Author)

Raffael Buff, born in 1990, holds a BSc and a MSc in Computer Science from the Swiss Federal Institute of Technology/ETH Zürich. His studies focused on topics such as Machine Learning, Big Data, Approximation Algorithms, Software Engineering and Distributed Systems. While studying, the author served as a project manager at ETH juniors – a student run consulting agency at ETH Zürich – from 2010 to 2012 and was head of the department IT at ETH juniors from 2011 to 2012. He holds the certificates ITIL Foundation and HERMES Advanced. On a sidenote, he also was Swiss Champion in Online Sudoku in 2006. Currently, the author works as a Business Consultant for an ICT company in Zürich, Switzerland.

Using Additional Information in Streaming Algorithms

Summary

Excerpt

Table Of Contents

Details

Author

Raffael Buff (Author)