Things I'm working on this summer
It's been a busy semester. My two main achievements were putting together a Python proof of concept for Dejector, a provably secure defense against SQL injection attacks, and a linear-case implementation of Meredith's Virtual Stewart, a support-vector-machine-based method of deriving user preference functions from random or keyword-searched database tuples. There's a lot I want to work on this summer, too, though, including:
- A real-world version of Dejector. The current version uses a toy implementation of SQL; I'm already working on one which uses the same dialect that PostgreSQL does. There are some interesting challenges inherent to this, and I'll talk about them in the next post.
- Adding nonlinear-SVM functionality to MVS. This is harder than it looks. Linear SVMs are convenient because once trained, they can be reduced to a single weight vector. If your feature space is f and your weight vector w, then each tuple in your test set has a score: f1 * w1 + f2 * w2 + ... + fn * wn. That translates nicely into an ORDER BY clause in an SQL statement. Nonlinear SVMs aren't that convenient; once you've trained a model, you've got some parameters available to you that you can use to copy the model elsewhere, but it's not as convenient. (That said, I have some thoughts on how to translate a polynomial-kernel SVM into a weight vector, since a polynomial kernel of order n captures the same meaning as "a nonlinear function on all n-ary combinations of features" -- for instance, if each instance vector is of order 2 and you want a polynomial learner of order 3, you'd have a weight vector like a1x13 + a2x12x2 + a3x1x22 + a4x23. But as you can imagine, this turns into a combinatoric explosion in large feature space and with large-order polynomials. For a 10-ary feature space and a fifth-order polynomial you're looking at |a| > 4000.) So I want to push this inside a database engine, like Postgres, and find a compact way to represent the model in SQL. I have some ideas; the hard part for this will be writing the glue.
- I also don't really like any of the existing SVM libraries. libsvm is written in C so ugly that it makes me want to rip out my eyes. SVMlight is much prettier, but it's encumbered by a stupid license that makes it basically impossible for me to extend, and it's still written in C, so every function is unique and there's a lot of code reuse; plus it doesn't use sequential minimal optimization. SMO-Java uses SMO (of course), but it's in Java, which isn't convenient for a database written in C. So I'd like to write a C++ library which handles kernel functions in a templatized fashion, and create extern C linkages so that I can incorporate it into pure-C stuff like Postgres.
- Now on to the fun stuff (as in "things I'm not doing for school"). The other day I was thinking about how it's easy to find tech articles thanks to /., and pop-culture articles thanks to BoingBoing, but finding new academic articles is kind of a pain. This led me to the idea of creating an RSS feed for Citeseer. arXiv.org already does RSS (thanks to Kragen, who feeds me all kinds of cool things, for pointing that out to me), but I've been ... less than thrilled ... with the quality of some of the articles I've seen there so far. Anyway, it turns out that Citeseer implements the Open Archive Initiative's Protocol for Metadata Harvesting, which in a nutshell means that it won't be hard at all to query Citeseer via OAI-PMH to get day-to-day updates and turn the returned XML into a feed. I'd like to do a proper Python implementation of OAI-PMH, just for practice, and will probably model it off of the XML-RPC stuff in Twisted.

<< Home