2005-01-18

Bye bye Blogspot...

From now on I'll be posting at porges.name. That is all.

2004-12-30

Bayesian scavenging.

Spent the afternoon coding the algorithms from SpamBayes into PHP. Hurrah for open-source!

2004-12-29

Spam prevention process

A quick outline of the process I'll be using to prevent comment spam.

  1. Hidden field hashing as described in "Comment spam prevention". This also forces a preview, which may fool spambots, and allows you to check the comments (for such things as invalid HTML, too many links) without having to perform all the following for the preview.
  2. Check the IP address against Spamhaus, DSBL, and any other RBLs that the user specifies. If one matches, block the IP for a short period and reject the comment.
  3. Find URIs.
  4. Check URIs against a blacklist, such as MT-Blacklist, or a personal blacklist such as Simon Willison's blacklist. If one matches, block the IP for a short period and reject the comment.
  5. Check URIs against SURBL. If one matches, block the IP for a short period and reject the comment.
  6. Run the comment through a Bayesian filter. If the match to other spam comments is high, block the IP for a short period and reject the comment. If the match is unsure, move the comment to the moderation queue.
  7. Optionally, for the paranoid:
    1. Follow links in the post (following all redirects) and check all the links on the resultant page.
    2. Force all comments to join the moderation queue, unless the user accepts an email verification or is authenticated through other means (TypeKey, site-specific registration).

If the comment passes all these tests it is probably not spam. Any transformations can be performed and the comment stored in the database.

2004-12-23

Apache's mod_rewrite and URI encoding...

For the rule "RewriteRule ^blog(.*)$ escrib/Escrib.php?arg=$1":

  • http://localhost/escrib/Escrib.php?arg=%252525 → arg is %2525 (correct)
  • http://localhost/blog/%252525 → arg is %25 (incorrect)

Argh. And according to the mailing list, there is no way around this.

So, what happens when a user wants to use a percentage sign in their titles?

  1. Title: "44% of US Citizens favour more restrictions on Muslims."
  2. Permalink: "/blog/44%25-of-us-citizens-favour-more-restrictions-on-muslims"
  3. Using mod_rewrite makes the variable in PHP: "/44%-of-us-citizens-favour-more-restrictions-on-muslims"

Phew. But:

  1. Title: "Apache hates %25!"
  2. Permalink: "/blog/apache-hates-%2525"
  3. Using mod_rewrite makes the variable in PHP: "/apache-hates-%"

Suggestions on what I should do? I think this would break even further if I start trying out some UTF-8 tests. (Yes, still a long way until IRIs will be supported!) I'd hate to only allow ASCII in permalinks... although if I can't guarantee that it won't break I may have to do so. Either that or only allow titles that are the same when encoded and double-decoded.

Update: D'oh! I'm a stupid-head. $_SERVER['REQUEST_URI']. Still, I'm leaving this here so others can learn from it :)

2004-12-20

Romanization test 2.

"イメージプレス" (from Standing Tall) transcribes to "imējipuresu", which I believe means "Imagepress".

Similarly, "カテゴリー" transcribes to "kategorī" ("categories"), "リンク" to "rinku" ("links"), and "アーカイブ" to "ākaibu" ("archives").

Of course, the macrons would need to be stripped as well. I'm still looking for a reference for kanji characters. Pointers would be appreciated!