**Mathematical proofs help clarify why certain algorithms for processing large data streams are better than others**

Computers are getting smaller and faster every day, but the amounts of information are only increasing - and at a rate faster than the processing capacity of the computing systems. The large databases are often serial; That is, they arrive in a continuous stream, such as data received from sensors or satellites, so it is difficult to impossible to go back and retrieve previous information. In recent decades, researchers in the fields of computer science and mathematics have faced the challenges of analyzing such large databases, partly through identifying and characterizing algorithms that can effectively deal with the huge amounts of data. Prof. Robert Krautgemmer from the Department of Computer Science and Applied Mathematics at the Weizmann Institute explains that it is possible to divide the problems in analyzing data streams into problems that are impossible to deal with - because this would require enormous computing resources - and reasonable problems, which, even if they look quite similar to the "impossible" problems, are much easier to deal with.

Prof. Krautgemmer clarifies what is meant by using an example from the field of cyber and information security: in order to detect a "distributed denial of service" (DDOS) attack, in which an overload of communication is produced on some server, the network technician must know how many different computers have contacted serve over the course of a day, and which computer did it the most times. A reply based on the digital identities (IP addresses) of the requesting computers would be very inefficient if the server scanned all of its history records. But an algorithm that uses random samples in a sophisticated way can estimate the number of computers in question with a margin of error of about 1%. Also, while it is difficult to identify which IP address appears the most times, there is an algorithm that can identify addresses that appear with a particularly high frequency (heavy hitters), if there are any.

"For me, the most amazing thing about our proofs is that the mathematical characterization that identifies whether an efficient algorithm is possible or not is not related to algorithms at all"

A good algorithm is such that the amount of memory it uses is significantly smaller than the size of the database itself, so the calculation can be fast and efficient. In recent studies, Prof. Krautgemmer and his colleagues presented a mathematical analysis that sorts the problems into easy and hard. To shed light on the studies, Prof. Krautgemmer returns to the cyber attack: "An algorithm for locating the most frequent IP address in the data stream - or even just estimating the frequency of its appearance - cannot be effective. But there is another question - the answer to which is easy, and in this case, even more important: is it possible to estimate how many different addresses appear with a frequency greater than 0 - that is, appear at least once." Such calculations include a common type of function called the L-norm^{P}, where the exponent p is some number. When p=0, it is easy to estimate the value of the function; When p is infinite, such an estimate is very difficult to perform. And what about p that is between the two values? It turns out that the estimate is difficult to perform for any value above 2, and that any value up to and including 2 is easy to perform; That is, there is a sharp transition - similar to a phase transition in physics. Why is this so? The accepted explanation was that any value above 2 is similar in nature to infinity. But in the course of characterizing a more general set of all symmetric functions, Prof. Krautgemmer and his colleagues provided Two mathematical proofs for the sharp separation, and show that there is a built-in mathematical principle that can describe this behavior.

"For me, the most amazing thing about these proofs," says Prof. Krautgemmer, "is that the mathematical characterization that identifies whether an efficient algorithm is possible or not, has nothing to do with algorithms at all."

בAnother study, Prof. Krautgemmer examined a larger set of problems, and asked how one can tell if a new problem has an efficient algorithm. One of the most basic methods in computer science, called reduction, performs a mathematical conversion of a new problem into a known basic problem, such as the L-problems^{P} In the above example, then one can simply run any known algorithm to solve the underlying problem. But if this method fails - that is, the new problem cannot be converted into a basic problem - is it possible to find an efficient algorithm using another method? Prof. Krautgemmer and his colleagues recently proved that if a new problem from this family cannot be converted into a basic problem, then no efficient algorithm exists. Their proof was based on a classical mathematical field named after the Polish mathematician Stefan Benach, who developed the field in the XNUMXs. "The conclusion is that if there is a good algorithm, however sophisticated it may be, then it is also possible to convert the new problem into a basic problem," he says. "Our contribution is that we have shown that something is algorithmically possible if, and only if, there is a simple mathematical conversion".

Prof. Krautgemmer adds: "We explain a computational phenomenon using something that seems unrelated, for example, using a branch of mathematics that does not contain any computational or algorithmic component. Such a characterization gets to the roots of the problem, such as, for example, why p>2 values are difficult to calculate, and provides a comprehensive explanation of its causes."

Prof. Krautgemmer's work is indeed theoretical, but it contributes to the research of computer scientists dealing with complex data streams - from communication networks to records of retail sales, satellite data or biological experiments.