Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Computer scientists prove that a 40-year-old algorithm is optimal (newsoffice.mit.edu)
201 points by kercker on June 11, 2015 | hide | past | favorite | 63 comments


Am I reading this wrong? The title is wrong, right?

The paper says that if a strongly sub-quadratic solution exists than the exponential time hypothesis (that SAT cannot be solved in subexponential time) is invalidated.

That's very interesting, but it's not a proof that no strongly sub-quadratic solution exists for SED.

Note that exponential time hypothesis is strictly stronger than P!=NP. Even if SAT can't be solved in poly time, that doesn't mean it can't be solved in subexponential time. There are functions that lie between polynomial and exponential...

Of course, the paper was careful to explain it, but the media summary...

Edit: I was interested to learn about the notion of strongly quadratic; there are O(n^2/log n) solutions to SED, but this paper is casting doubt on solutions with time complexity O(n^(2-delta)) for any delta > 0. Another commenter mentioned a method to solve SED like this.


I haven't seen too many results that rely on SETH, so just did a bit of research. Ryan Williams (from Stanford) at least doesn't believe SETH is true -- here is a nice talk by him on SETH: http://www.imsc.res.in/~vraman/exact/ryan.pdf


Interesting to note there slide 5 -- "For many polynomial time problems, improving the best known algorithms, even slightly, implies ¬SETH or ¬ETH."

So apparently the edit-distance result discussed here is part of a history of similar results.

Edit: More detail on this in slides 11 and forward.


Not just the exponential time hypothesis; the strong exponential time hypothesis. ETH is just the assumption that CNF-SAT requires exponential time, i.e., time 2^(cn) for some c; SETH is the assumption that it can't be done with c<1.

Edit: Apparently n here refers to just the number of variables rather than the overall size of the problem instance. But that makes sense, because that's where the exponential dependence of the naïve algorithm is in the first place; adding more clauses just makes checking each possibility take slightly longer.


If you're going to be so specific you should also note that ETH and P != NP could be equivalent. So "strictly stronger" is also a hypothesis.


"Prove" is an incredibly strong word. Too strong for this case.


The Wagner-Fischer (1970) algorithm they mention (AKA Needleman-Wunsch, Lowrance-Wagner, etc) is quadratic in time and space. It might never get any better than that in time - an improved version with linear space complexity was presented by Hirschberg in 75 - but that doesn't mean execution times can't be improved. Today there are various variations of the same family of algorithms with ever improving execution times. Some adapt the algorithm to massively parallel setups with hundreds of GPUs, others use speculative execution, some use approximations, and so forth.


> that’s disappointing, since a computer running the existing algorithm would take 1,000 years to exhaustively compare two human genomes.

I did a quick googling [1] and

> Our algorithm divides the problem into independent `quadrants' ...

> Our results show that our GPU implementation is up to 8x faster when operating on a large number of sequences.

It's still soul crushing. Why did our genome have to be that long :(

BTW, do you have numbers for setups with hundreds of GPUs?

I'm also left wondering about results using stochastic solutions. On how accuracy and problem size relate.

[1] http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=tru...


I'm confused by this. 1,000 years seems a bit steep to me.

Suppose you have two files, `bob.genome` and `mary.genome`. Let's say they are 1gb each [1].

I think I can diff two 1gb files in less than 1,000 years.

diff(1) shows "deletions, insertions, and substitutions".

Therefor, I don't believe it. Yet. What did I miss?

1. http://stackoverflow.com/questions/8954571/how-much-memory-w... (Rounded up because Fermi estimation [2].)

2. https://what-if.xkcd.com/84/


> diff(1) shows "deletions, insertions, and substitutions".

diff(1) doesn't give you a _minimal_ set of edits to apply to go from one file to the other, just _a_ set of edits.


Also, I think he's picturing two almost-equal files. In that case the average running time should be way lower, no? (I believe the quadratic time is worst case)


> two almost-equal files. In that case the average running time should be way lower, no?

Yes, indeed! What you're thinking of is "output-sensitive" algorithms. There are some output-sensitive algorithms for edit distance. The fastest one I found is in "Improved Algorithms for Approximate String Matching" by Dimitris and Georgios Papamichail. They note:

"We designed an output sensitive algorithm solving the edit distance problem between two strings of lengths n and m respectively in time O((s-|n-m|)min(m,n,s)+m+n) and linear space, where s is the edit distance between the two strings."

http://arxiv.org/abs/0807.4368


Cool. The naive approach yields O(exp(s)*n), this looks like a nice improvement.



Yes, but complexity theorists are mostly interested in worst case analysis I believe (which I think is rather unfortunate) -- so the quoted numbers are probably what you get from plugging in 100,000^2 * dt or so.


It's also got a lower quadratic bound, it has to fill in the whole matrix of m*n cells, at least for the simple implementation. There may well be some optimisation for almost identical strings.


1,000 years? Really?

Isn't it more likely that somebody misquoted "slow, like a half an hour" as "slow, like a THOUSAND YEARS"?


  >>> 1000 / ((1024. ** 6) / 3e9 / 86400 / 365)
  82.0593593076069
The algorithm is quadratic in the input size. For a Gigabyte of data, that's 1024^6 operations. Dividing that by 3 * 10^9 operations/second (assuming a 3GHz CPU), 86400 (the number of seconds in a day), and 365 (the number of days in a year), we obtain the runtime (in years) assuming that comparing a single byte takes exactly one operation. Dividing 1000 by that number, we get ~82 operations to compare a single byte, and that doesn't look unreasonable.


They're quoting exponential (2^N), not quadratic (N^2) time.

If on some machine a quadratic-time algorithm took, say, a hundredth of a second to process 100 elements, an exponential-time algorithm would take about 100 quintillion years.


That is yet another section of the article, the thousand years clearly reference the edit distance, which is quadratic.


from the description of "a grid ... flooding diagonally" I think it's not comparing two 1gb strings it is comparing every substring of those strings

  for i := 1..bob'length loop
    for j := i..mary'length loop
      editdistance(substring(bob,1,i), substring(mary,i,j));
    end loop
  end loop


>BTW, do you have numbers for setups with hundreds of GPUs?

I saw a talk on it a while ago, I can only remember they were using CUDAlign and Smith-Waterman (the basic idea is the same). Doing some googling too this seems to be a reasonably recent work with GPUs and CUDAlign (DOI 10.1109/CCGrid.2014.18).

>I'm also left wondering about results using stochastic solutions.

Another talk, I think they were running Smith-Waterman too. The speculative part was during the traversal of the matrix to get the edit distance. It's not the most time consuming part of the algorithm. I got in late for the talk and I didn't get to hear what they did about filling the matrix in the first place, but I imagine they might have done something similar. I'm not very familiar with Smith-Waterman so I can't go into details.


Also,

This is a result about the worst case expected time for an abstract algorithm for finding the edit distance between any two sequences of less than a given length.

It's possible an algorithm which usually takes much less time exists. And more to the point, an algorithm knowing something about the structure of genomes might wind-up with even less time. Matching broad areas of different genomes together first seems like an obvious speed-up in this case and I'm sure there are a number of others.


Dick Lipton (GATech) has a blog post discussing this paper: https://rjlipton.wordpress.com/2015/06/01/puzzling-evidence/.


Short version: computer scientists have shown that if the Edit Distance problem can be solved in sub-quadratic time, then SAT can be solved in sub-exponential time, and therefore P=NP.


Exponential time hypothesis implies P!=NP, but converse is not true. It is possible that SAT takes subexponential but superpolynomial time.


Ah, a very valid point!

This sort of thing is exactly what I love about Computer Science.


The article clearly states this, but the title doesn't, so N.B.: the proof is conditional on exponential time hypothesis, that SAT requires exponential time. Of course we don't know whether it does.


What the article doesn't state clearly is that this assumes the strong exponential time hypothesis: it assumes that SAT cannot be solved in time 1.9999^n -- in other words that it's impossible to do better than the brute-force algorithm, which has complexity 2^n (up to polynomial factors). That's a very strong assumption.



Thanks! I don't understand why the MIT News article has no links to the sources.


It actually does. The link is on the right sidebar.


The authors of the paper note (though this MIT News summary omits) that there is a well-known algorithm for edit distance that runs is subquadratic time. In fact, it runs in O(n^2/log^2 n) on a word RAM. It uses a method sometimes called "Four Russians" or "shaving a log".

Of course, this does not invalidate the results; I mention it only to dispel the notion that a reader may get from this MIT News summary that there is no subquadratic algorithm likely to be found.

It's also interesting reading; just Google for "edit distance" and "Four Russians" and you'll find many summaries.


> But it also means that computer scientists can stop agonizing about whether they can do better.

This is incorrect... because a "better" algorithm might have 99.9% accuracy but be millions of times faster.


That's pedantic and also incorrect. The problem of whether an exact solution can be found efficiently can be put to rest. You might not choose to use this algorithm, but that's not really the point of algorithms research. You've just picked a strawman to criticise the wording on.


I don't think so. I think he is pointing out a really, really important thing that many overlook because of the way we have come to teach computer science.

When evaluating different algorithms, there are lots of criteria we could use. How often is it correct? How long does it take to run? How difficult is it for people to read? How often does it use the letter "r" which happens to be broken on my keyboard?

Computer science made ENORMOUS strides by picking out a specific criterion (running time on the computers of the time) and finding a way to make it mathematically rigorous (asymptotic performance analysis, Big-O notation, and all of the related mechanisms that we learn in computer science classes). This was TREMENDOUSLY valuable, and by turning the whole power of mathematical analysis loose on the problem it formed the modern field of algorithm analysis and completely transformed how we build computers.

But we need to remember that this is base on one particular simplification of how computers work. It assumes a Von-neuman architecture where execution steps are the key criterion. We have extended this framework to consider things like parallel execution... that was fruitful also. More recently, we've been noticing that our actual machines are no longer dominated by the "steps" in the algorithm, but most often are dominated by memory usage, so we have turned the same formalism onto the use of memory -- but still (in my opinion) lack a rigorous approach for analyzing the combination of steps taken and memory usage.

Even more significant is the fact that there are OTHER things we could trade off. One of those is accuracy. Look at some of the research on probabilistic algorithms: you will find that there are some incredible gains to be made with losses in accuracy that are well within the bounds of what are acceptable for most uses. Yet this isn't covered in an introductory computer science course, so many programmers are not even aware of the option. With so much of traditional algorithm analysis already mapped out by the past several decades of researchers, much of the fertile ground in the near future will, I believe, lie in investigating these other sorts of tradeoffs.


No - there are many fruitful formalisations of parallel programming (the CLRS chapter is pretty good) and indeed, of non-von Neumann architectures (check out balancing, comparison networks etc etc).

This sort of work is not reliant on von-Neumann - it quantifies the amount of work you need to do. Coincidentally, the algorithm listed parallelises very well; it has a good span.

Of course it is useful to avoid galactic algorithms, but the article is totally correct. It would be incorrect to read its conclusion as anything but what it actually says. As a general principle, sure, but it shouldn't be a criticism of this article.


There are 2 much better criticism they could have made:

1. The proof is conditional on the unproven hypothesis that 3-SAT requires 2^n time.

2. There could be an exact algorithm which uses randomness and has less than n^2 expected runtime. After all, it's still unknown whether ZPP=EXP afaik (although it almost certainly does not).


There is plenty of work being done on approximation algorithms, for good reason. If you're dealing with scientific data that's subject to measurement noise, it's valuable to research whether that 0.1% difference actually matters or not. That said, this is great work, it's just not a closed case.


Yes, of course - and there are some really cool proofs (easy to understand also). A personal favourite is the Run of Christofides. BUT that's not relevant here.


It seems relevant to start talking about approximation algorithms as the next step after proving a "1000 year" runtime for exactly comparing genomes. My understanding is that comp bio folks typically use heuristics anyway.


99.9% accurate is only better for some values of "better". An airplane routing system that ensures 99.9% of the planes will arrive at their destination and 0.1% will collide with terrain or other airplanes and ensure all passengers die screaming in horror is not what I would call a good airplane router.

You may have a million-times faster solution that's approximate and that may be just fine. It's not, however, a solution that always produces the correct answer. There is a difference between good enough and perfect.


To be fair, the idea of correctness (in the optimization sense) is based on simplifying assumptions. For example, using the edit distance to compare genomes makes an assumption that insertions, deletions, and substitutions "cost" the same amount, but the reality is far more complicated due to the mechanics of how mutations actually happen. The edit distance is incredibly useful anyway, and a 99.9% answer is possibly just as useful


...Except, of course, that even the exact solution won't always produce the correct answer, due to the wonders of computers not being perfect.

Maybe 99.9% accurate isn't good enough, but there is always a level of inaccuracy from the algorithm where your random silent computer hardware errors dominate. And that may be still faster than the exact algorithm.


Genomic analysis relies heavily on probabilistic approaches anyways, it's less likely to matter than with an airport routing system


You need to be precise. It may be the case that you can solve a different problem, using probabilistic techniques, much more quickly. This happens often. It could be the case that genes have characteristics that make them amenable to such a specialized solution, since they aren't just random strings (introns notwithstanding).

But that would indeed be a different problem. This problem, the precise edit distance problem, has been proved to be optimally-bounded by the existing solution, and yes, computer scientists can stop agonizing about this problem.

It is not advantageous to anybody to try to blur the distinctions, nor is it any sort of "gotcha!" or win to do so... it is only a loss of clarity.


Rather my point was that it's easy to give up on a problem if you have a limited definition of what it means for a solution to be "better". Usually we even have to resort to satisficing for our definition of "better".


Still incorrect.

It might be possible to solve the precise edit distance problem exactly, and much faster using a different technique, but only for specific instances of the problem.


Limiting the problem to specific instances is still a different problem. Otherwise, the problem is ill-defined and "certain instances" of all problems can be solved in O(1).

The problem is what it is. You don't get to change its domain or its range or the statistical distribution of its instances or the nature of its solution (probabalistic, etc.) or anything else, or you've defined a new problem. There's nothing wrong with creating new problems, but you have changed nothing about the original... problems are immutable.


His remark is important as a refutation to this sentence from the article

>In a sense, that’s disappointing, since a computer running the existing algorithm would take 1,000 years to exhaustively compare two human genomes. But it also means that computer scientists can stop agonizing about whether they can do better.

People are using the result to draw conclusions about real world problems that may well be easier than general case.


In my opinion it is incorrect for a much better reason - we don't know that P != NP. I am not aware of any arguments for P != NP that are much better than »We tried hard to find fast algorithms but failed.« or »Writing a great symphony is much harder than recognizing one.« and therefore I think the possibility that P == NP is often not considered as serious as it maybe should.


Have you looked for intuitive arguments about why P ≠ NP? Scott Aaronson's blog is a good place to find such things [1]:

> If P=NP, then the world would be a profoundly different place than we usually assume it to be. There would be no special value in “creative leaps,” no fundamental gap between solving a problem and recognizing the solution once it’s found. Everyone who could appreciate a symphony would be Mozart; everyone who could follow a step-by-step argument would be Gauss; everyone who could recognize a good investment strategy would be Warren Buffett.

Complexity theorists have also found more formal reasons why it it would be hard to prove P ≠ NP. For example, you can't use "natural" proofs, "relativizing" proofs, or "algebrizing" proofs [2]. (Conversely, in a P=NP world you'd expect it to be easy to prove P=NP.)

1: http://www.scottaaronson.com/blog/?p=122

2: http://www.math.ias.edu/~avi/PUBLICATIONS/MYPAPERS/SCOTT/alg...


When I mentioned writing a symphony I was essentially referring to your quote from Scott Aaronson and I am also aware that some known proofing methods have been shown not to be strong enough to settle the question.

I don't really buy into most of the examples of the quote. For example it is already a complex problem to quantify what makes a symphony a great symphony, once you solved this it of course becomes easy to recognize great symphonies. To write a great symphony you could then just exhaustively search the space of symphonies and rate them until you find a great one.

But does it really sound highly implausible to you that once you figured out what makes a symphony a great symphony you could use that knowledge to prune the search space and discover great symphonies much faster that using exhaustive search?

And the fact that known techniques are not good enough to settle the question doesn't sound like a surprise to me either. The very fact that the problem is not yet settled means that the usual approaches don't work and we have to use known things in an innovative way or come up with something entirely new.

So that some things don't work hardly comes as a surprise but not even this implies that P == NP is more likely right or more likely wrong, it just says that figuring it out is hard. You could maybe argue that it is evidence for the problem being independent from ZFC or whatnot if we and our tools constantly fail to settle the problem one way or the other.


The paper mentions an algorithm published in 1980 is O( (n/log(n))^2 ), which is better than the Wagner-Fischer algorithm mentioned in the article, even if it's not strongly subquadratic.


O( (n/log(n))^2 ) > 0(n^1.99999999999999) for sufficently large N


Yes, the fact it's >O(n^k) for all k in [1,2) is why it's not strongly. It's still less than O(n^2).


The edit distance between two strings L and S, where |L| >= |S|, is at most |L|. For any number n and string W, a DFA can be built which accepts an input IFF its edit distance to W is < n. Such a DFA can be built and run in O(|W|) time.[1]

What's the complexity of a 'binary search' for the edit distance between L and S using this method? The DFA construction complexity grows very fast with respect to n, but we only need log(|L|) DFAs. And the search cut points can be skewed to amortize the growth with respect to n (i.e. first n << |L|/2).

[1] see page 63 of https://scholar.google.com/scholar?cluster=99460367496861516...


Well... I guess now it's up to the hardware engineers to make it faster... ;-)


This is so wrong. Just because it's NP-hard (optimization problem, right?) it doesn't mean that there aren't very-very-very good heuristic algorithms that can make the search significantly faster than exponential in the vast majority of the cases. Take a look at SAT solvers and see how deep the rabbit hole really goes.


Heuristic algorithms can never be "optimal". The article is precise and correct. I use heuristic algorithms for all my DNA alignment, and for all my probabilistic modeling, but I do that with the knowledge that I'm not necessarily finding an optimal solution, I'm only find a "pretty good" solution like the SAT solvers do.


It's θ(n²) time with Needleman-Wunsch / Wagner-Fischer. The paper now says that the problem requires Ω(n²) time as long as P ≠ NP. Nothing NP-hard about sequence alignment, and it also isn't an optimization problem.

You are also mixing up two kinds of heuristics. SAT solvers use heuristics to guide the search, making it faster in the expected case (but possibly also slower), but they do not sacrifice correctness. Sequence alignment heuristics sacrifice correctness for speedups. Those are two very different concepts.


It's questionable whether SAT solvers really work well for "the vast majority of the cases". They work very well for the problems they are tasked with. DPLL-based SAT solvers don't work well with randomly generated SAT instances which survey-propagation-based solvers excel on. On SAT-instances that arise in the verification of CPUs the opposite is true. Both approaches break down where SAT instances have more than 10^7 variables, and that's almost all cases. ;)

Still, it never ceases to amaze me just how well DPLL-based solvers work on the instances that I feed them, and that despite all the theoretical limitations that we know they have.


I love 'proofs' which rely on an unproven component.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: