Bayesian scavenging.

Spent the afternoon coding the algorithms from SpamBayes into PHP. Hurrah for open-source!


Spam prevention process

A quick outline of the process I'll be using to prevent comment spam.

  1. Hidden field hashing as described in "Comment spam prevention". This also forces a preview, which may fool spambots, and allows you to check the comments (for such things as invalid HTML, too many links) without having to perform all the following for the preview.
  2. Check the IP address against Spamhaus, DSBL, and any other RBLs that the user specifies. If one matches, block the IP for a short period and reject the comment.
  3. Find URIs.
  4. Check URIs against a blacklist, such as MT-Blacklist, or a personal blacklist such as Simon Willison's blacklist. If one matches, block the IP for a short period and reject the comment.
  5. Check URIs against SURBL. If one matches, block the IP for a short period and reject the comment.
  6. Run the comment through a Bayesian filter. If the match to other spam comments is high, block the IP for a short period and reject the comment. If the match is unsure, move the comment to the moderation queue.
  7. Optionally, for the paranoid:
    1. Follow links in the post (following all redirects) and check all the links on the resultant page.
    2. Force all comments to join the moderation queue, unless the user accepts an email verification or is authenticated through other means (TypeKey, site-specific registration).

If the comment passes all these tests it is probably not spam. Any transformations can be performed and the comment stored in the database.


Apache's mod_rewrite and URI encoding...

For the rule "RewriteRule ^blog(.*)$ escrib/Escrib.php?arg=$1":

  • http://localhost/escrib/Escrib.php?arg=%252525 → arg is %2525 (correct)
  • http://localhost/blog/%252525 → arg is %25 (incorrect)

Argh. And according to the mailing list, there is no way around this.

So, what happens when a user wants to use a percentage sign in their titles?

  1. Title: "44% of US Citizens favour more restrictions on Muslims."
  2. Permalink: "/blog/44%25-of-us-citizens-favour-more-restrictions-on-muslims"
  3. Using mod_rewrite makes the variable in PHP: "/44%-of-us-citizens-favour-more-restrictions-on-muslims"

Phew. But:

  1. Title: "Apache hates %25!"
  2. Permalink: "/blog/apache-hates-%2525"
  3. Using mod_rewrite makes the variable in PHP: "/apache-hates-%"

Suggestions on what I should do? I think this would break even further if I start trying out some UTF-8 tests. (Yes, still a long way until IRIs will be supported!) I'd hate to only allow ASCII in permalinks... although if I can't guarantee that it won't break I may have to do so. Either that or only allow titles that are the same when encoded and double-decoded.

Update: D'oh! I'm a stupid-head. $_SERVER['REQUEST_URI']. Still, I'm leaving this here so others can learn from it :)


Romanization test 2.

"イメージプレス" (from Standing Tall) transcribes to "imējipuresu", which I believe means "Imagepress".

Similarly, "カテゴリー" transcribes to "kategorī" ("categories"), "リンク" to "rinku" ("links"), and "アーカイブ" to "ākaibu" ("archives").

Of course, the macrons would need to be stripped as well. I'm still looking for a reference for kanji characters. Pointers would be appreciated!


Comment spam prevention.

Here's a (relatively) simple, cookie-less technique I've thought up to help prevent comment-spamming. I'll use it in co-operation with other, already-existing techniques.

The comment form has a hidden field "hashnodata" which the server sets to SHA1(ENTRY_ID + USERS_IP + XSTRING + IP_NUMBER_OF_COMMENTS_ON_ENTRY). EntryID is the ID of the entry in question. User IP is the page viewer's IP address. XString is a unique string that is randomly generated for each installation of the blog software. IP Number Of Comments On Entry is the number of comments that the IP address has posted on the entry in question.

When the software recieves a POST to its comment script it computes the applicable hash for the POST data and checks it against the comment form "hashnodata" field.

If it matches, show them the preview page, which contains a hidden field of "hash" which is SHA1(ENTRY_ID + USERS_IP + XSTRING + IP_NUMBER_OF_COMMENTS_ON_ENTRY + POST_DATA), and a hidden field of "hashnodata" which is the same as that computed for the comment form.

Also if it matches, check the "hash" value exists and is valid. If it is valid the comment is valid, the preview page above doesn't need to be shown and the comment can continue down the validation path. If it exists and is not valid, the user has changed some information on the preview page. Re-show the preview page using the step above.

If "hashnodata" doesn't match, the user has submitted bogus data. You can safely ban the IP for a short period. The only way this would fail is if the user leaves a window open on the Entry page while they submit a comment using another window and then try to submit a comment later using the first window (IP_NUMBER_OF_COMMENTS_ON_ENTRY will cause the hash to fail) — but I see this as a highly unlikely occurance.

Comments on any weaknesses/problems that this method has are welcome, and wanted!

Update: Comment field names should also be encrypted by hashing their names with the XSTRING value, so that bots can't automate easily. They will need to read the labels in order to find out where to put stuff. Randomising the order of the fields will give another layer of protection as well.

Transcription example.

As an example of the language-dependent transcription (or transliteration) I talked about a few posts ago... a PHP function to change Unicode Russian text into ASCII. (Following the directions on the Wikipedia page.)

static function Russ2Asc ($str) {
	$str = str_replace(' ','-');

	$str = preg_replace('/(?<!Б|б|В|в|Г|г|Д|д|Ж|ж|З|з|Й|й|К|к|Л|л|М|м|Н|н|П|п|Р|р|С|с|Т|т|Ф|ф|Х|х|Ц|ц|Ч|ч|Ш|ш|Щ|щ)(Е|е)/','ye',$str);
	$str = preg_replace('/(ъ|ь)(?=(А|а|О|о|У|у|Ы|ы|Э|э|Я|я|Ё|ё|Ю|ю|И|и))/','y',$str);
	$str = preg_replace('/(И|и|Ы|ы)(Й|й)(?=\.|,|-|;|:|!|\?|\Z)/','y',$str);
	$str = str_replace(array('А','а'),'a',$str);
	$str = str_replace(array('Б','б'),'b',$str);
	$str = str_replace(array('В','в'),'v',$str);
	$str = str_replace(array('Г','г'),'g',$str);
	$str = str_replace(array('Д','д'),'d',$str);
	$str = str_replace(array('Е','е'),'e',$str);
	$str = str_replace(array('Ё','ё'),'yo',$str);
	$str = str_replace(array('Ж','ж'),'zh',$str);
	$str = str_replace(array('З','з'),'z',$str);
	$str = str_replace(array('И','и'),'i',$str);
	$str = str_replace(array('Й','й'),'y',$str);
	$str = str_replace(array('К','к'),'k',$str);
	$str = str_replace(array('Л','л'),'l',$str);
	$str = str_replace(array('Э','э'),'e',$str);
	$str = str_replace(array('Ю','ю'),'yu',$str);
	$str = str_replace(array('Я','я'),'ya',$str);
	$str = str_replace(array('М','м'),'m',$str);
	$str = str_replace(array('Н','н'),'n',$str);
	$str = str_replace(array('О','о'),'o',$str);
	$str = str_replace(array('П','п'),'p',$str);
	$str = str_replace(array('Р','р'),'r',$str);
	$str = str_replace(array('С','с'),'s',$str);
	$str = str_replace(array('Т','т'),'t',$str);
	$str = str_replace(array('У','у'),'u',$str);
	$str = str_replace(array('Ф','ф'),'f',$str);
	$str = str_replace(array('Х','х'),'kh',$str);
	$str = str_replace(array('Ц','ц'),'ts',$str);
	$str = str_replace(array('Ч','ч'),'ch',$str);
	$str = str_replace(array('Ш','ш'),'sh',$str);
	$str = str_replace(array('Щ','щ'),'shch',$str);
	$str = str_replace(array('ъ','ь'),'',$str);
	$str = str_replace(array('Ы','ы'),'y',$str);
	return $str;

Early testing shows that it correctly transliterates "Союз Советских Социалистических Республик" into a nice, URL-friendly "soyuz-sovetskikh-sotsialisticheskikh-respublik". I've got a similar function working (or at least as far as I can tell) for Japanese katakana/hiragana, based on the Hepburn system.

Update: I've found a very good resource for transliteration tables. However, as this is not a critical feature, I'll be delaying it until later in development.


URL structure

  • Post permalink: http://example.com/post/slug-goes-here
  • Last post: http://example.com/last
  • Latest posts (default front page): http://example.com/latest
  • Latest X posts: http://example.com/latest/X
  • First post: http://example.com/first
  • Categories: http://example.com/category/cat/sub-cat/sub-sub-cat
  • Notify script (both pingback + trackback): http://example.com/notify/slug-goes-here
  • Comment script: http://example.com/comment
  • Previous/Next: http://example.com/post/slug-goes-here/(prev|next)
  • Specific version of a post: http://example.com/post/slug-goes-here.en.html@2004-12-13T12:15:24Z

Is the /post/ in the first redundant? It seems to make more sense to have it there...

Update #1: added comment, notify & prev/next.



This post was prompted by the horrific permalink for my previous post... "http://escrib.blogspot.com/2004/12/itrntinliztin.html"!

As Charl has said, mixing memorable slugs with hard-to-remember dates doesn't make any sense.

Why have something that is easy to remember only to bugger it up with something like 2004/08?

Slug-only permalinks are the way I'm going. Slugs shall be generated by the following rules:

  1. Formatting (such as HTML tags) is stripped from the title.
  2. Spaces are converted to ASCII hyphen-minus characters.
  3. Optionally, some form of language-dependent transcription takes place to put characters into the ASCII range for prettier URLs.



Well, I have successfully coaxed PHP into all kinds of character-set trickery.

  • TrackBacks and PingBacks (along with any other HTML page) can be converted into UTF-8 NFC.
  • Request URLs are normalized to UTF-8 NFC. (To avoid this problem.)
  • User input is converted to and stored as UTF-8 NFC.
  • Output is converted to the browser's declared preference, both in content-type and character-set. As described in Respecting Q-values during content negotiation for the detection, and using XSL transformations for the conversion.

The only bad thing I can see coming out of this is:

"Why aren't you using Unicode?! Everything must be Unicode-compliant!"
"Well I am... if your browser requests it!"
;) (As I have said before... it suprised me when I realised that Firefox was prefering ISO-8859-1 over UTF-8!)

World Domination Step One: Complete.

Top ranking on Google for the query "escrib". Thank you, obscure blog name!

Validating ISO 8601 input.

Ensures that everything is entered within valid ranges... just a shame I couldn't get this to check leap years as well ;)


Respecting q-values during content negotiation.

Just wrote this in the course of playing around and thought some others might find it useful. Returns the UA's preferred option for any of the HTTP headers which potentially have q-values. Supply it with the HTTP header you want evaluated, a list of what things your program can serve in the order it would like to serve them (in lowercase — in_array is case-sensitive), and whether or not failing to match one will cause the request to fail.

function getPreference($string,$pref_order,$fatal=true) {
	$acceptArray = explode(',',strtolower(str_replace(' ','',$_SERVER[$string])));
	$use = array();
	foreach($acceptArray as &$key) {
		$newkey = explode(';',$key);
		$newkey[1] = (count($newkey) < 2) ? '1.0' : substr($newkey[1],2);
		$use[$newkey[1]][] = $newkey[0];
	foreach($use as $qval) {
		foreach ($pref_order as $pref) {
			if (strpos($pref,'/')) {
				if (in_array($pref,$qval) || in_array('*/*',$qval) || in_array(substr($pref,0,strpos($pref,'/')).'/*',$qval))
					return $pref;
			else {
				if (in_array($pref,$qval) || in_array('*',$qval))
					return $pref;
	if($fatal) {
		header('HTTP/1.1 406 Not Acceptable');
	return false;
header('Content-Type: text/plain');
echo getPreference('HTTP_ACCEPT',array('application/xhtml+xml','text/html','text/vnd.wap.wml','text/plain','application/atom+xml','text/xml'))."\n";
echo getPreference('HTTP_ACCEPT_CHARSET',array('utf-8','utf-16','iso-8859-1'),false)."\n";
echo getPreference('HTTP_ACCEPT_LANGUAGE',array('en'),false)."\n";

An interesting thing I happened upon while doing this: my browser's preferred charset was ISO-8859-1 and not UTF-8. Oops :)

TrackBack oddity

Why does the TrackBack specification say to check the dc:identifier attribute of the Description element when the rdf:about attribute already specifies the resource that the description is about? This duplication of information is a little puzzling.

Perhaps it has something to do with this, which describes a change from an earlier version of the document, and seems to suggest that originally the rdf:about attribute was misused by a twisted reversal of the RDF triplet:

In the RDF, the TrackBack Ping URL should now be stored in the trackback:ping element, rather than rdf:about.
Further investigation via archive.org's Wayback Machine confirms that this is true.

If the rdf:about attribute had been used correctly in the first place, it would make more sense; the Dublin Core attributes would not be required, and we'd end up with something like this:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:tb="http://madskills.com/public/xml/rss/module/trackback/"> <rdf:Description rdf:about="http://example.com/2004-12-25T12:14:13Z" tb:ping="http://example.com/trackback/2004-12-25T12:14:13Z"/> </rdf:RDF>

However, I'm hesitant to drop the DC attributes as I'm not sure how many blog systems actually follow the recommendations of the spec and check dc:identifier over rdf:about.

Of course, a simple <link rel="trackback" href="http://example.com/trackback/2004-12-25T12:14:13Z"/> would have been easier. ;)