Getting parts of a URL (Regex)


Given the URL (single line):

How can I extract the following parts using regular expressions:

  1. The Subdomain (test)
  2. The Domain (
  3. The path without the file (/dir/subdir/)
  4. The file (file.html)
  5. The path with the file (/dir/subdir/file.html)
  6. The URL without the path (
  7. (add any other that you think would be useful)

The regex should work correctly even if I enter the following URL:

This question is tagged with regex language-agnostic url

~ Asked on 2008-08-26 11:01:37

27 Answers


A single regex to parse and breakup a full URL including query parameters and anchors e.g.


RexEx positions:

url: RegExp['$&'],







you could then further parse the host ('.' delimited) quite easily.

What I would do is use something like this:

proto $1
host $2
port $3
the-rest $4

the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.

~ Answered on 2008-08-26 11:06:09


I realize I'm late to the party, but there is a simple way to let the browser parse a url for you without a regex:

var a = document.createElement('a');
a.href = '';

['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
    console.log(k+':', a[k]);

protocol: http:
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo

~ Answered on 2012-09-18 04:10:33


I'm a few years late to the party, but I'm surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression. The regular expression, written by Berners-Lee, et al., is:

 12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to

results in the following subexpression matches:

$1 = http:
$2 = http
$3 = //
$4 =
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related

For what it's worth, I found that I had to escape the forward slashes in JavaScript:


~ Answered on 2014-11-05 20:22:50


I found the highest voted answer (hometoast's answer) doesn't work perfectly for me. Two problems:

  1. It can not handle port number.
  2. The hash part is broken.

The following is a modified version:


Position of parts are as follows:

int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12

Edit posted by anon user:

function getFileName(path) {
    return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];

~ Answered on 2008-11-21 16:28:57


I was trying to solve this in javascript, which should be handled by:

var url = new URL('http://a:[email protected]:890/path/[email protected]/foo.js?foo=bar&bingobang=&[email protected]#foobar/bing/[email protected]?bang');

since (in Chrome, at least) it parses to:

  "hash": "#foobar/bing/[email protected]?bang",
  "search": "?foo=bar&bingobang=&[email protected]",
  "pathname": "/path/[email protected]/foo.js",
  "port": "890",
  "hostname": "",
  "host": "",
  "password": "b",
  "username": "a",
  "protocol": "http:",
  "origin": "",
  "href": "http://a:[email protected]:890/path/[email protected]/foo.js?foo=bar&bingobang=&[email protected]#foobar/bing/[email protected]?bang"

However, this isn't cross browser (, so I cobbled this together to pull the same parts out as above:


Credit for this regex goes to who posted this jsperf (originally found here: who came up with the regex this was originally based on.

The parts are in this order:

var keys = [
    "href",                    // http://user:[email protected]:81/directory/file.ext?query=1#anchor
    "origin",                  // http://user:[email protected]:81
    "protocol",                // http:
    "username",                // user
    "password",                // pass
    "host",                    //
    "hostname",                //
    "port",                    // 81
    "pathname",                // /directory/file.ext
    "search",                  // ?query=1
    "hash"                     // #anchor

There is also a small library which wraps it and provides query params: (also available on bower)

If you have an improvement, please create a pull request with more tests and I will accept and merge with thanks.

~ Answered on 2014-07-02 09:16:47


I needed a regular Expression to match all urls and made this one:


It matches all urls, any protocol, even urls like

ftp://user:[email protected]:8080/dir1/dir2/file.php?param1=value1#hashtag

The result (in JavaScript) looks like this:

["ftp", "user", "pass", "www.cs", "server", "com", "8080", "/dir1/dir2/", "file.php", "param1=value1", "hashtag"]

An url like

mailto://[email protected]

looks like this:

["mailto", "admin", undefined, "www.cs", "server", "com", undefined, undefined, undefined, undefined, undefined] 

~ Answered on 2012-08-15 19:56:29


Propose a much more readable solution (in Python, but applies to any regex):

def url_path_to_dict(path):
    pattern = (r'^'
               r'((?P<user>.+?)(:(?P<password>.*?))[email protected])?'
    regex = re.compile(pattern)
    m = regex.match(path)
    d = m.groupdict() if m is not None else None

    return d

def main():
    print url_path_to_dict('')


'host': '', 
'user': None, 
'path': '/example/example/example.html', 
'query': None, 
'password': None, 
'port': None, 
'schema': 'http'

~ Answered on 2013-07-26 23:51:52


subdomain and domain are difficult because the subdomain can have several parts, as can the top level domain,

 the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)  
 the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$  
 the path with the file : http://[^/]+/(.*)  
 the URL without the path : (http://[^/]+/)  

(Markdown isn't very friendly to regexes)

~ Answered on 2008-08-26 11:17:28


This improved version should work as reliably as a parser.

   // Applies to URI, not just URL or URN:
   // (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
   // [email protected] matches the entire uri
   // $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
   // $2 matches authority (host, user:[email protected], etc)
   // $3 matches path
   // $4 matches query (http GET REST api, etc)
   // $5 matches fragment (html anchor, etc)
   // Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
   // Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
   // (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
   // Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
   function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
      if( !schemes )
         schemes = '[^\\s:\/?#]+'
      else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
         throw TypeError( 'expected URI schemes' )
      return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
         new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )

   function uriSchemesRegExp()
      return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'

~ Answered on 2010-09-16 07:21:21


Try the following:

^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\[email protected])?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

It supports HTTP / FTP, subdomains, folders, files etc.

I found it from a quick google search:

~ Answered on 2008-08-26 11:10:16



From my answer on a similar question. Works better than some of the others mentioned because they had some bugs (such as not supporting username/password, not supporting single-character filenames, fragment identifiers being broken).

~ Answered on 2009-01-14 04:13:34


You can get all the http/https, host, port, path as well as query by using Uri object in .NET. just the difficult task is to break the host into sub domain, domain name and TLD.

There is no standard to do so and can't be simply use string parsing or RegEx to produce the correct result. At first, I am using RegEx function but not all URL can be parse the subdomain correctly. The practice way is to use a list of TLDs. After a TLD for a URL is defined the left part is domain and the remaining is sub domain.

However the list need to maintain it since new TLDs is possible. The current moment I know is maintain the latest list and you can use domainname-parser tools from google code to parse the public suffix list and get the sub domain, domain and TLD easily by using DomainName object: domainName.SubDomain, domainName.Domain and domainName.TLD.

This answers also helpfull: Get the subdomain from a URL


~ Answered on 2009-10-09 04:39:51


Here is one that is complete, and doesnt rely on any protocol.

function getServerURL(url) {
        var m = url.match("(^(?:(?:.*?)?//)?[^/?#;]*)");
        console.log(m[1]) // Remove this
        return m[1];








~ Answered on 2012-12-27 16:17:33


None of the above worked for me. Here's what I ended up using:


~ Answered on 2013-01-17 18:12:50


I like the regex that was published in "Javascript: The Good Parts". Its not too short and not too complex. This page on github also has the JavaScript code that uses it. But it an be adapted for any language.

~ Answered on 2015-05-31 22:00:07


I would recommend not using regex. An API call like WinHttpCrackUrl() is less error prone.

~ Answered on 2009-11-30 19:35:38


Java offers a URL class that will do this. Query URL Objects.

On a side note, PHP offers parse_url().

~ Answered on 2008-08-26 11:55:04


I tried a few of these that didn't cover my needs, especially the highest voted which didn't catch a url without a path (

also lack of group names made it unusable in ansible (or perhaps my jinja2 skills are lacking).

so this is my version slightly modified with the source being the highest voted version here:


~ Answered on 2016-11-23 13:53:49


I build this one. Very permissive it's not to check url juste divide it.


  • match 1 : full protocole with :// (http or https)
  • match 2 : protocole without ://
  • match 3 : host
  • match 4 : slug
  • match 5 : param
  • match 6 : anchor





~ Answered on 2020-10-21 17:35:11


Using hometoast's regex works great.

But here is the deal, I want to use different regex patterns in different situations in my program.

For example, I have this URL, and I have an enumeration that lists all supported URLs in my program. Each object in the enumeration has a method getRegexPattern that returns the regex pattern which will then be used to compare with a URL. If the particular regex pattern returns true, then I know that this URL is supported by my program. So, each enumeration has it's own regex depending on where it should look inside the URL.

Hometoast's suggestion is great, but in my case, I think it wouldn't help (unless I copy paste the same regex in all enumerations).

That is why I wanted the answer to give the regex for each situation separately. Although +1 for hometoast. ;)

~ Answered on 2008-08-26 11:23:45


I know you're claiming language-agnostic on this, but can you tell us what you're using just so we know what regex capabilities you have?

If you have the capabilities for non-capturing matches, you can modify hometoast's expression so that subexpressions that you aren't interested in capturing are set up like this:


You'd still have to copy and paste (and slightly modify) the Regex into multiple places, but this makes sense--you're not just checking to see if the subexpression exists, but rather if it exists as part of a URL. Using the non-capturing modifier for subexpressions can give you what you need and nothing more, which, if I'm reading you correctly, is what you want.

Just as a small, small note, hometoast's expression doesn't need to put brackets around the 's' for 'https', since he only has one character in there. Quantifiers quantify the one character (or character class or subexpression) directly preceding them. So:


would match 'http' or 'https' just fine.

~ Answered on 2008-08-26 11:34:49


regexp to get the URL path without the file.

url = 'http://domain/dir1/dir2/somefile' url.scan(/^(http://[^/]+)((?:/[^/]+)+(?=/))?/?(?:[^/]+)?$/i).to_s

It can be useful for adding a relative path to this url.

~ Answered on 2009-07-16 22:22:56


 * Parse URL to get information
 * @param   url     the URL string to parse
 * @return  parsed  the URL parsed or null
var UrlParser = function (url) {
    "use strict";

    var regx = /^(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^:@\/#\?]+)(?:\:([^:@\/#\?]+))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/,
        matches = regx.exec(url),
        parser = null;

    if (null !== matches) {
        parser = {
            href              : matches[0],
            withoutHash       : matches[1],
            url               : matches[2],
            origin            : matches[3],
            protocol          : matches[4],
            protocolseparator : matches[5],
            credhost          : matches[6],
            cred              : matches[7],
            user              : matches[8],
            pass              : matches[9],
            host              : matches[10],
            hostname          : matches[11],
            port              : matches[12],
            pathname          : matches[13],
            segment1          : matches[14],
            segment2          : matches[15],
            search            : matches[16],
            hash              : matches[17]

    return parser;

var parsedURL=UrlParser(url);

~ Answered on 2017-08-16 08:28:28


I tried this regex for parsing url partitions:




Group 1.    0-7 https:/
Group 2.    0-5 https
Group 3.    8-22
Group 6.    22-50   /my/path/sample/asd-dsa/this
Group 7.    22-46   /my/path/sample/asd-dsa/
Group 8.    46-50   this
Group 9.    50-74   ?key1=value1&key2=value2
Group 10.   51-74   key1=value1&key2=value2

~ Answered on 2020-07-22 07:25:50


The best answer suggested here didn't work for me because my URLs also contain a port. However modifying it to the following regex worked for me:


~ Answered on 2020-11-30 08:29:06


The regex to do full parsing is quite horrendous. I've included named backreferences for legibility, and broken each part into separate lines, but it still looks like this:


The thing that requires it to be so verbose is that except for the protocol or the port, any of the parts can contain HTML entities, which makes delineation of the fragment quite tricky. So in the last few cases - the host, path, file, querystring, and fragment, we allow either any html entity or any character that isn't a ? or #. The regex for an html entity looks like this:

$htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"

When that is extracted (I used a mustache syntax to represent it), it becomes a bit more legible:


In JavaScript, of course, you can't use named backreferences, so the regex becomes


and in each match, the protocol is \1, the host is \2, the port is \3, the path \4, the file \5, the querystring \6, and the fragment \7.

~ Answered on 2016-09-02 05:37:28


String s = "";

String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)";

System.out.println("1: " + s.replaceAll(regex, "$1"));
System.out.println("2: " + s.replaceAll(regex, "$2"));
System.out.println("3: " + s.replaceAll(regex, "$3"));
System.out.println("4: " + s.replaceAll(regex, "$4"));

Will provide the following output:
1: https://
3: /
4: axis2/services/BLZService?wsdl

If you change the URL to
String s = ""; the output will be the following :
1: https://
3: ?
4: wsdl=qwerwer&ttt=888

Yosi Lev

~ Answered on 2015-12-24 10:55:39

Most Viewed Questions: