Regex to Select a Sub-Set of a Regex Select
An answer to this question on Stack Overflow.
Question
I haven't had any luck searching on this and I believe that's because I don't know the key terms to use to explain what I'm looking for. I have the following regex that I'm using to distinguish internal links on a set of HTML pages from external links:
(?<=a href=")[^http](.*?)(\.html")
So it won't select "http://www.example.com/foo/bar.html" from:
<a href="http://www.example.com/foo/bar.html">bar</a>
but will select "/foo/bar.html" from:
<a href="/foo/bar.html">bar</a>
This much is working great. Now I want to do a subselect on the selected string "/foo/bar.html" to isolate just the ".html" part. Is this possible? Possibly with a substring or another lookbehind/forward? I've setup an example here:
https://www.regex101.com/r/gZ6bP5/2
This is for a global find/replace in Sublime Text Editor. So I believe I am restricted to just the regex for this. I understand that a variable find/replace is possible, but I have not been able to find an example of that in action.
EDIT: Just to clarify, the regex I have to distinguish between external/internal links works great (although imperfectly as commenters have noted). The question is about how to select just the ".html" portion of the match.
Thanks in advance!
Answer
This seems to do the trick:
(?<=a href=")(?!http)[^"]*\/([^"]+)(?=">)
The idea:
- Use look-behind
(?<=a href=")to ensure we are in a link anchor. - Use look-ahead
(?=">)to ensure the anchor ends. - Use negative look-ahead
(?!http)to ensure things don't start with http. - Use a greed match
[^"]*to capture all characters up to the last slash, without crossing a quote-boundary. - Grab all characters after the last slash but before the quote boundary in a capture group
([^"]+)
Problems you may encounter:
- This is valid HTML
<a target="_blank" href="bob.html">. - This is a valid link
<a href="ftp://bob.html">.
Though you can build regexes to deal with these as well.
To deal with the target issue, we drop the look-behind, and the final look-ahead:
<a[^>]*href="(?!http)[^"]*\/([^"]+)
Now we are matching a string that starts with <a and looking for a href=" inside of it. By dropping (?=">), we are able to handle anchors with many tags.
To deal with ftp, we could do the following:
<a[^>]*href="(?!(http|ftp))[^"]*\/([^"]+)
Now, you can wrap the beginning of the string in a capture group:
(<a[^>]*href="(?!(http|ftp))[^"]*\/)([^"]+)
And alter $1 (the part up to FILENAME.EXTENION) and $2 (the FILENAME.EXTENSION) as you see fit.
An example is at: https://www.regex101.com/r/gZ6bP5/3.