Regex for Autolinking URLs

For a recent project I was wanting to perform an exceptionally common task. Converting things that look like URLs in the text into clickable links. However wherever I’ve seen this implemented before I’ve always encountered the same annoying problem, namely the links break when the user types a URL and adds some punctuation at the end since the punctuation gets captured as part of the link.

For example:

If you like the Seattle Mariners and enjoy learning about baseball statistics you should really visit http://ussmariner.com.

It struck me that this should be something that would be solvable with Lookahead, that is we could check whether the punctuation character was followed directly by a character that could be part of the URL.

The following regex attempts to do this. It will allow . , ! ? ; and : to appear at the end of a URL without including them in the match.

$str = preg_replace('%(https?://(([^ .,!?;:"'()rnt])|((.|,|!|?|;|:)(?=[_a-z0-9])))+)%i', '<a href="\1">\1</a>', $str);

It works using an alternation. The first side of the alternation checks for any character that isn’t in our punctuation list (and certain other characters such as quote marks and spaces that we simply don’t want to allow in a URL). The other half checks the characters in the punctuation list and uses lookahead to ensure that they’re followed by another character.

I’m sure there’s room for improvement here, but I’ve been pretty please with how this has worked so far.