Asp.net Examples: Regex match html content without screwing up the tags

When needing to highlight words in a string containing HTML we found we soon ran into problems when the word we were searching for appeared in the middle of a tag..

Imagine the example:

If I wanted to bold all occurances of geekzilla, I'd usually do this:

unfortunately, when dealing with HTML rather than just text, this will screw my tag and produce the following

We did a lot of googling and found loads of people discussing ways to ignore the tags. Suggetions rainged from sax parsers to character by character loops (nasty).

Armed with an excellent regex for matching an entire HTML tag we came up with the following solution
Our Solution

Use a custom Regex match evaluator to ignore any tags. This works well and is very fast. There may be a slicker way to do this, I hope someone is inspired enough to figure it out and post a comment



private string replaceString = "";

public string Parse(string content)

{

const string regTagName = @"<.[^>]*>";



Regex reg = new Regex(@"(" + regTagName + ")|(geekzilla)",

RegexOptions.IgnoreCase | RegexOptions.Multiline);



// this is what I'd like to replace the match with

replaceString = "$1";



// do the replace

content = reg.Replace(content, new MatchEvaluator(MatchEval));



return content;

}



protected string MatchEval(Match match)

{

if (match.Groups[1].Success) 

{

// the tag

return match.ToString();

}

if (match.Groups[2].Success) 

{

// the text we're interested in

return Regex.Replace(match.ToString(), "(.+)", replaceString);

}

// everything else

return match.ToString();

}

Asp.net Examples

Friday, October 17, 2008

Regex match html content without screwing up the tags

Imagine the example:

No comments:

Categories

Blog Archive

More

Ads