# Tuesday, 27 May 2008

If you save a HTML page to a text file, or as a string in memory, you are likely to get the relative path contained in any IMG or HREF tags

Example snippet:

<H1>Hello from Test Page</H1>
<IMG src="images/TestImage1.jpg">

When you then load you saved HTML page you will not see the image because the HTML is looking for TestImage1.jpg in the images folder which doesn't exist. All that exists is your saved HTML text file.
So we need to parse the HTML and prefix the missing server path to the src tag in the HTML.

The most efficient way to achieve this is to use the power of Regular Expressions, but I'm no expert with RegEx's so after trawling the Internet looking for a suitable RegEx example, rather than read a book ;), I finally found the correct expression at

The pattern: "<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>"
The Match Evaluator: "<$1$2=\"" + absoluteUrl + "$3\"$4>"

The example method:

public static String ConvertRelativePathsToAbsolute(String text, String absoluteUrl)
    String value = Regex.Replace(text, "<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>", "<$1$2=\"" + absoluteUrl + "$3\"$4>",
                                 RegexOptions.IgnoreCase | RegexOptions.Multiline);

    // Now just make sure that there isn't a // because if
    // the original relative path started with a / then the
    // replacement above would create a //.

    return value.Replace(absoluteUrl + "/", absoluteUrl);

Using the method:

ConvertRelativePathsToAbsolute(myHTML, "http://localhost/")

Will return:

<H1>Hello from test Page</H1>
<IMG src="http://localhost/images/TestImage1.jpg">

Works great for me so thanks!

Links - Convert Relative Paths to Absolute Using Regular Expressions

Tuesday, 27 May 2008
C# | Regular Expressions

