If you write software for the web that allows users to submit or share URLs (comment systems, mail clients, forums, URL shorteners, etc), you may find yourself in a position where you need to filter out malicious links.

Fortunately, there are several free options for you to better protect your systems and your users against bad guys, and they’re pretty simple to implement. (My examples are in PHP, but could easily be adapted to whatever language you prefer.)

Google SafeBrowsing

The most well-known is probably Google’s SafeBrowsing, which is the system that powers the Chrome and Firefox warnings when you try to click through to a site that’s been flagged as hosting phishing sites or malware, and provide diagnostic pages like this. Using their REST API is free, you just need to sign up for an API key here.

They do currently throttle API calls to 500 per request, and 10k per 24-hour period, but you can apply to have your threshold increased if you’re writing a high-traffic application. They don’t currently charge extra for this, but you do have to send them an email to antiphish-malware-cap-req@google.com to apply to have additional accounts provisioned.

Here is a simple function to implement Google SafeBrowsing in your PHP script.

First, let’s start by setting a few config variables. You’ll want to put these into your config file, or somewhere that the variables will be accessible to the function:

$safebrowsing['api_key'] = "XXXXXXXXXXXXXX";
$safebrowsing['api_url'] = "https://sb-ssl.google.com/safebrowsing/api/lookup";

Here’s the function itself. As you can see, it’s a simple cURL request. (Tempting though it may be, don’t use file_get_contents() for this. It doesn’t work as expected.)

function checkSafeBrowsing($longUrl) {
	global $safebrowsing;
	
    $url = $safebrowsing['api_url']."?client=api&";
    $url .= "apikey=".$safebrowsing['api_key']."&appver=1.0&";
    $url .= "pver=3.0&url=".urlencode($longUrl);

	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;	
}

Then, to use this code, you’d simply call the function, passing it the URL you want to check:

// Check that a URL was passed, and sanitize it
if (isset($_POST['myUrl'])) {
     $longUrl = filter_var($_POST['myUrl'], FILTER_SANITIZE_URL);
     $safetycheck = checkSafeBrowsing($longUrl);
}

if ($safetycheck == true) {
    // Do something here if it fails, for example 
    // redirect the user to an error page

} else {
    // Do something here if it passes, for example 
    // inserting it into a database
}

If you want to drill down into what kind of bad link it is, Google’s response will contain one of the following statuses:

  • phishing
  • malware
  • phishing,malware
  • ok

You can use these responses in your code to provide additional information to your users or to program a different response based on the type of bad link it is (phishing vs malware).

To test your system, Google has provided http://ianfette.org as a test domain that will always return positive.

Google does have some usage restrictions and requirements in your user-visible messaging, which you should check out, but all of their requirements are quite reasonable.

I typically provide a link to the diagnostics page on Google so that users have more information about the link they were trying to submit.

SURBLs

SURBLs are lists of web sites that have appeared in unsolicited messages, and given the nature of phishing and malware links and how they’re spread, this can often be a good way to sniff out bad links. SURBL.org has an easy method to query their databases via DNS, documented here with an extensive FAQ here.

/**
 * Check a URL against the 3 major blacklists
 *
 * @param string $url The URL to check
 * @return mixed true if blacklisted, false if not blacklisted
 */
function ozh_is_blacklisted($url) {
    $parsed = parse($url);
          
    // Remove www. from domain (but not from www.com)
    $parsed['host'] = trim(preg_replace( '/^www\.(.+\.)/i', '$1', $parsed['host']));
 
    //The major blacklists
    $blacklists = array(
     'zen.spamhaus.org',
     'multi.surbl.org',
     'black.uribl.com'
    );
   
    // Check against each black list, exit if blacklisted
    foreach( $blacklists as $blacklist ) {
    	$domain = $parsed['host'] . '.' . $blacklist . '.';
		$record = dns_get_record(trim($domain));		
		if (count($record) > 0 ) {
       		return true;  
        } 
     } 
     
     return false; 
}

function parse($url){
    if(strpos($url,"://")===false && substr($url,0,1)!="/") $url = "http://".$url;
    $info = parse_url($url);
    if($info)
    return($info);
}

To implement this, it works just like the Google SafeBrowsing function:

// Check that a URL was passed, and sanitize it
if (isset($_POST['myUrl'])) {
	$longUrl = filter_var($_POST['myUrl'], FILTER_SANITIZE_URL);
        $ozh_blacklisted = ozh_is_blacklisted($longUrl);
}

if ($ozh_blacklisted == true) {
    // Do something here if it fails, for example 
    // redirect the user to an error page

} else {
    // Do something here if it passes, for example 
    // inserting it into a database
}
}

To test your implementation, SURBL.org has provided surbl-org-permanent-test-point.com.multi.surbl.org as a test domain that should always return positive.

$ dig surbl-org-permanent-test-point.com.multi.surbl.org.

;; QUESTION SECTION:
;surbl-org-permanent-test-point.com.multi.surbl.org. IN A

;; ANSWER SECTION:
surbl-org-permanent-test-point.com.multi.surbl.org. 180	IN A 127.0.0.126

I have occasionally found some intermittent inconsistencies with the SURBL.org responses, and the DNS lookup required to use it can sometimes cause some latency in high-traffic sites, but it’s definitely worth looking into and trying out.

Other Options

Google SafeBrowsing and SURBLs aren’t your only options. You may also want to check out Phishtank.com, which allows you to download their database and do lookups locally. This can save overhead from API calls in your application, but also means your database can become out of date if you’re not paying attention.

If you intend to download them programmatically, you just have to sign up for a developer’s key.

VirusTotal, a subsidiary of Google, is a free online service that analyzes files and URLs enabling the identification of viruses, worms, trojans and other kinds of malicious content detected by antivirus engines and website scanners.

VirusTotal has a public REST API (rate-limited to 4 requests per minute) which returns a JSON object.

There are additional APIs out there, but since each API call creates overhead, you should probably stick to the ones that are considered the most reliable, lest you waste resources on an API that isn’t really giving good results.

For this reason, I tend to stick to Google SafeBrowsing (supplemented sometimes with SURBLs), and it’s served me pretty well, especially since Google SafeBrowsing covers both malware and phishing, while others tend specialize in one or the other.

Architecting around rate limits

Although most rate-limits are pretty generous for typical use-cases, it’s wise to consider this limitation while you’re architecting your application.

For example, if you’re writing a URL-shortener, you’ll want to prevent phishers and bad guys from abusing your system to obfuscate their bad links, which can result in complaints (or even blacklisting) from your hosting company. One consideration you’ll want to make is whether you want to hit the APIs when the user is generating a short URL, just when the someone tries to click through on it, or both.

If you check for malware as they’re creating the link, the benefit would be that you have a smaller database, since you prevent bad links from even being stored. On the other hand, if you’re not checking on the shortened link’s click-through, a short link created by your system that was later compromised (for example, a WordPress site that was clean when the short link was created, but has since been hacked and is now serving malware, with or without the site owner’s knowledge.)

Additionally, if your application is a contact form, discussion forum, helpdesk system, etc, you have some UX considerations to think about. What if your user doesn’t realize the link is being detected as malware? What if they’re posting to *report* something as malware? You’ll need to consider what happens in their on-app experience if a URL in their content gets flagged. Do you present the form again with their content with a warning? Do you let them post it anyway but remove the URL? Consider the purpose of the app and the user’s overall experience.

If you only check on click-through, you’ll end up with lots of potentially bad records in the database, but the people clicking through on the links will be protected.

If you check on both creation and click-through, your database will be smaller and your users will be safer, but you may run out of API calls. This may or may not be an issue when you first start off, but if your application gains popularity, it may mean an emergency refactor down the line.

Consider the use-cases for your application, your available resources and the additional impact a larger database will have.

Also make sure that your fail condition on the APIs will handle a lookup failure well. If you end up hitting your rate limits (or if the API is just unavailable for some reason), your application should handle it gracefully, without timing out, and without puking out a bunch of error messages. Depending on what your app does, you may want to allow the user to continue if the API call fails, queue the request, ask them to try again later, etc. Any third-party dependency should be carefully considered, with failure conditions and risks thought through up-front.

As always, be sure to actually read the documentation on these services to be aware of their implementation guidelines, limitations and additional requirements. You don’t want to end up blocked because you didn’t read the documentation and didn’t fulfill all of their (very reasonable) requirements.

Image credit: Security Generation

Advertisement

Themeforest

Advertisement

468x60_makemoney
Vaping
Previous post

How I Quit Smoking with Personal Vaporizers

postcatriarchy
Next post

Please Stop Asking Me to Speak About Women in Technology

snipe

snipe

I’m a tech geek/dev/infosec-nerd/scuba diver/blacksmith/sword-fighter/crime fighter/ENTP/warcrafter/activist. I'm the CTO at Mass Mosaic and the CEO of Grokability, Inc. in San Diego, CA. Tweet at me @snipeyhead or read more...

  • Martin

    Thank you for this post. It help me to understand how it works. But when I run script (below) I always get “Notice: Undefined variable: safetycheck in C:xampphtdocstesttestsafe.php on line 28”. I try to print $data variable to check what’s wrong and possible here is a problem. It is empty. I have correct api key. Please, help me.

  • Martin

    Script can not be attached.

    • Please use something like pastebin.com or Github Gists to show me the code.

  • Chris

    Is there any obvious reason why I would constantly receive
    a response code of 403 when running my curl function?

    • Nothing leaps to mind, unless they’ve changed something in the past year since this was written. Can I see your code? (Use Pastebin, don’t try to paste code in the comments here.)

  • ngg590

    the api key for google whats mean ? for server or browser ?