Research statement:

I am interested in applying machine learning techniques in Systems or other less traditional areas. I split my MS at CMU evenly between architecture and pattern recognition/signal processing which ended up giving me a natural foundation to apply the machine learning algorithms I'd learned to the systems problems I was working on. I think this is a promising area right now; computers are much faster than they were even 10 years ago, and there are huge amounts of data available online.

My first project was Laika, which reconstructed the data structures used by a program from its memory image without debugging information. It worked by classifying each machine word (usually as a pointer, string, or integer) and then clustering the resulting strings of machine words. We used this to make a virus checker for polymorphic worms by running Laika on the memory images of both programs at once. If the programs used similar data structures, many clusters would contain objects from both programs.

My second project is Macho, which tries to write programs from a combination of unit tests and examples. This seems unnecessarily complicated, but it's actually quite logical. The problem with natural language programming is that as the language becomes more abstract the number of ``reasonable'' solutions increases. The programming system must pick one using heuristics. Not only will it make mistakes, but, worse, the programmer won't know what decisions were made and what the resulting program does. Testing the resulting programs (and in our case we only used one test) not only removes buggy or incorrect solutions but also reduces the ambiguity. The natural language provides moderate information over the entire program space, while the examples provide precise information over a tiny fraction of the program space. The whole is more than the sum of the parts.

I think there is a huge amount of interesting work that can be done in combining programming inputs. Some problems are easy to describe in natural language, while others are easy with an example, and still others would be better with pictures or pseudocode or equations. In addition, any really good system would be interactive and incremental; if it was 'close' the user should be able to guide it to a good solution. Of course, I don't know how to do this. But I have some ideas that might work, and I'm excited to try.