# Mathematics of String Spaces and Algorithmic Applications (09w5124)

Arriving in Banff, Alberta Sunday, January 25 and departing Friday January 30, 2009

## Organizers

Alberto Apostolico (Georgia Tech)

Gad Landau (University of Haifa)

Ian Munro (University of Waterloo)

Shan Muthukrishnan (Rutgers University)

Cenk Sahinalp (Simon Fraser University)

## Objectives

Combinatorial pattern matching community, over the years, have developed many interesting algorithms that involve strings as objects. Examples include algorithms for string matching under several measures of similarity or distance, indexing, aligning, comparing strings etc. More recently there has been considerable interest in understanding strings as objects with relationships between them represented as distance

measures. We believe that a workshop devoted to the investigation of the mathematics of string spaces with specific emphasis on their relation to normed vector spaces such as $L_1$ and $L_2$ -for which there is a much better mathematical understanding- will be of considerable interest.

The main topics that will be covered by the proposed workshop will be: (1) embedding properties of string spaces, (2) approximating distances in string spaces, (3) indexibility of and efficient similarity search in string spaces, (4) sketching algorithms for strings and their applications in streaming and (5) applications of property testing methods for processing collections of strings. The workshop will also have discussions on potential applications of these mathematical results to more traditional algorithmic problems involving strings.

One of the main goals of the workshop is the identification of the key challenges and open problems related to mathematics of string spaces and their algorithmic applications. Here are a sample of interesting problems that will be discussed: (1) tight bounds on the volume of a ball of radius r in the string space (as a function of origin and alphabet size), (2) a poly-log d distortion embedding of edit distance to $L1$, (3) approximate or exact nearest neighbors data structures for strings under edit distance variants, (4) tightly approximating edit distance in near linear or sublinear time, (5) any lower bound on approximate randomized data structures for near or nearest neighbor search in metric spaces. We also hope to identify the most helpful tools available for such problems, developed specifically for string spaces or other domains.

Related Bibliography:

Piotr Indyk, Rajeev Motwani: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998.

Eyal Kushilevitz, Rafail Ostrovsky, Yuval Rabani: Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces. STOC 1998.

Martin Farach-Colton, Piotr Indyk: Approximate Nearest Neighbor Algorithms for Hausdorff Metrics via Embeddings. FOCS 1999.

Allan Borodin, Rafail Ostrovsky, Yuval Rabani: Lower Bounds for High Dimensional Nearest Neighbor Search and Related Problems. STOC 1999.

Amit Chakrabarti, Bernard Chazelle, Benjamin Gum, Alexey Lvov: A Lower Bound on the Complexity of Approximate Nearest-Neighbor Searching on the

Hamming Cube. STOC 1999.

Graham Cormode, Mike Paterson, Suleyman Cenk Sahinalp, Uzi Vishkin: Communication complexity of document exchange. SODA 2000.

Piotr Indyk: Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. FOCS 2000.

Paolo Ferragina, Giovanni Manzini: Opportunistic Data Structures with Applications. FOCS 2000.

S. Muthukrishnan, Suleyman Cenk Sahinalp: Approximate nearest neighbors and sequence comparison with block operations. STOC 2000.

Graham Cormode, S. Muthukrishnan, Suleyman Cenk Sahinalp: Permutation Editing and Matching via Embeddings. ICALP 2001.

Graham Cormode, S. Muthukrishnan: The string edit distance matching problem with moves. SODA 2002.

Alexandr Andoni, Michel Deza, Anupam Gupta, Piotr Indyk, and Sofya Raskhodnikova: Lower Bounds for Embedding of Edit Distance into Normed Spaces. SODA 2003.

Tugkan Batu, Funda ErgÃun, Joe Kilian, Avner Magen, Sofya Raskhodnikova, Ronitt Rubinfeld, Rahul Sami: A sublinear algorithm for weakly approximating edit distance. STOC 2003.

Mihai Badoiu, Piotr Indyk: Fast approximate pattern matching with few indels via embeddings. SODA 2004.

Piotr Indyk: Approximate Nearest Neighbor under edit distance via product metrics. SODA 2004.

Suleyman Cenk Sahinalp, Andrey Utis: Hardness of String Similarity Search and Other Indexing Problems. ICALP 2004.

Ziv Bar-Yossef, T. S. Jayram, Robert Krauthgamer, Ravi Kumar: Approximating Edit Distance Efficiently. FOCS 2004.

Amit Chakrabarti, Oded Regev: An Optimal Randomised Cell Probe Lower Bound for Approximate Nearest Neighbour Searching. FOCS 2004.

Rafail Ostrovsky, Yuval Rabani: Low distortion embeddings for edit distance. STOC 2005.

Subhash Khot, Assaf Naor: Nonembeddability theorems via Fourier analysis. FOCS 2005.

Moses Charikar and Robert Krauthgamer: Embedding the Ulam metric into $L1$, Theory of Computing Volume 2, Article 11, 2006.

Robert Krauthgamer, Yuval Rabani: Improved lower bounds for embeddings into $L1$. SODA 2006.

Tugkan Batu, Funda Ergun, Suleyman Cenk Sahinalp: Oblivious string embeddings and edit distance approximations. SODA 2006.

Howard J. Karloff, Subhash Khot, Aranyak Mehta, Yuval Rabani: On earthmover distance, metric labeling, and 0-extension. STOC 2006.

Nir Ailon, Bernard Chazelle: Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. STOC 2006.

Alexandr Andoni, Piotr Indyk: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. FOCS 2006.

Mihai Patrascu, Mikkel Thorup: Higher Lower Bounds for Near-Neighbor and Further Rich Problems. FOCS 2006.

Assaf Naor, Gideon Schechtman: Planar Earthmover is not in $L_1$. FOCS 2006.

Alexandr Andoni, Robert Krauthgamer: The Computational Hardness of Estimting Edit Distance. FOCS 2007.

Alexandr Andoni, Piotr Indyk, Robert Krauthgamer: Earth Mover Distance over High-Dimensional Spaces. SODA 2008.

measures. We believe that a workshop devoted to the investigation of the mathematics of string spaces with specific emphasis on their relation to normed vector spaces such as $L_1$ and $L_2$ -for which there is a much better mathematical understanding- will be of considerable interest.

The main topics that will be covered by the proposed workshop will be: (1) embedding properties of string spaces, (2) approximating distances in string spaces, (3) indexibility of and efficient similarity search in string spaces, (4) sketching algorithms for strings and their applications in streaming and (5) applications of property testing methods for processing collections of strings. The workshop will also have discussions on potential applications of these mathematical results to more traditional algorithmic problems involving strings.

One of the main goals of the workshop is the identification of the key challenges and open problems related to mathematics of string spaces and their algorithmic applications. Here are a sample of interesting problems that will be discussed: (1) tight bounds on the volume of a ball of radius r in the string space (as a function of origin and alphabet size), (2) a poly-log d distortion embedding of edit distance to $L1$, (3) approximate or exact nearest neighbors data structures for strings under edit distance variants, (4) tightly approximating edit distance in near linear or sublinear time, (5) any lower bound on approximate randomized data structures for near or nearest neighbor search in metric spaces. We also hope to identify the most helpful tools available for such problems, developed specifically for string spaces or other domains.

Related Bibliography:

Piotr Indyk, Rajeev Motwani: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998.

Eyal Kushilevitz, Rafail Ostrovsky, Yuval Rabani: Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces. STOC 1998.

Martin Farach-Colton, Piotr Indyk: Approximate Nearest Neighbor Algorithms for Hausdorff Metrics via Embeddings. FOCS 1999.

Allan Borodin, Rafail Ostrovsky, Yuval Rabani: Lower Bounds for High Dimensional Nearest Neighbor Search and Related Problems. STOC 1999.

Amit Chakrabarti, Bernard Chazelle, Benjamin Gum, Alexey Lvov: A Lower Bound on the Complexity of Approximate Nearest-Neighbor Searching on the

Hamming Cube. STOC 1999.

Graham Cormode, Mike Paterson, Suleyman Cenk Sahinalp, Uzi Vishkin: Communication complexity of document exchange. SODA 2000.

Piotr Indyk: Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. FOCS 2000.

Paolo Ferragina, Giovanni Manzini: Opportunistic Data Structures with Applications. FOCS 2000.

S. Muthukrishnan, Suleyman Cenk Sahinalp: Approximate nearest neighbors and sequence comparison with block operations. STOC 2000.

Graham Cormode, S. Muthukrishnan, Suleyman Cenk Sahinalp: Permutation Editing and Matching via Embeddings. ICALP 2001.

Graham Cormode, S. Muthukrishnan: The string edit distance matching problem with moves. SODA 2002.

Alexandr Andoni, Michel Deza, Anupam Gupta, Piotr Indyk, and Sofya Raskhodnikova: Lower Bounds for Embedding of Edit Distance into Normed Spaces. SODA 2003.

Tugkan Batu, Funda ErgÃun, Joe Kilian, Avner Magen, Sofya Raskhodnikova, Ronitt Rubinfeld, Rahul Sami: A sublinear algorithm for weakly approximating edit distance. STOC 2003.

Mihai Badoiu, Piotr Indyk: Fast approximate pattern matching with few indels via embeddings. SODA 2004.

Piotr Indyk: Approximate Nearest Neighbor under edit distance via product metrics. SODA 2004.

Suleyman Cenk Sahinalp, Andrey Utis: Hardness of String Similarity Search and Other Indexing Problems. ICALP 2004.

Ziv Bar-Yossef, T. S. Jayram, Robert Krauthgamer, Ravi Kumar: Approximating Edit Distance Efficiently. FOCS 2004.

Amit Chakrabarti, Oded Regev: An Optimal Randomised Cell Probe Lower Bound for Approximate Nearest Neighbour Searching. FOCS 2004.

Rafail Ostrovsky, Yuval Rabani: Low distortion embeddings for edit distance. STOC 2005.

Subhash Khot, Assaf Naor: Nonembeddability theorems via Fourier analysis. FOCS 2005.

Moses Charikar and Robert Krauthgamer: Embedding the Ulam metric into $L1$, Theory of Computing Volume 2, Article 11, 2006.

Robert Krauthgamer, Yuval Rabani: Improved lower bounds for embeddings into $L1$. SODA 2006.

Tugkan Batu, Funda Ergun, Suleyman Cenk Sahinalp: Oblivious string embeddings and edit distance approximations. SODA 2006.

Howard J. Karloff, Subhash Khot, Aranyak Mehta, Yuval Rabani: On earthmover distance, metric labeling, and 0-extension. STOC 2006.

Nir Ailon, Bernard Chazelle: Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. STOC 2006.

Alexandr Andoni, Piotr Indyk: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. FOCS 2006.

Mihai Patrascu, Mikkel Thorup: Higher Lower Bounds for Near-Neighbor and Further Rich Problems. FOCS 2006.

Assaf Naor, Gideon Schechtman: Planar Earthmover is not in $L_1$. FOCS 2006.

Alexandr Andoni, Robert Krauthgamer: The Computational Hardness of Estimting Edit Distance. FOCS 2007.

Alexandr Andoni, Piotr Indyk, Robert Krauthgamer: Earth Mover Distance over High-Dimensional Spaces. SODA 2008.