Friday, October 17, 2008

Regex match html content without screwing up the tags

When needing to highlight words in a string containing HTML we found we soon ran into problems when the word we were searching for appeared in the middle of a tag..

Imagine the example: 


If I wanted to bold all occurances of geekzilla, I'd usually do this:

unfortunately, when dealing with HTML rather than just text, this will screw my tag and produce the following



We did a lot of googling and found loads of people discussing ways to ignore the tags. Suggetions rainged from sax parsers to character by character loops (nasty).

Armed with an excellent regex for matching an entire HTML tag we came up with the following solution
Our Solution

Use a custom Regex match evaluator to ignore any tags. This works well and is very fast. There may be a slicker way to do this, I hope someone is inspired enough to figure it out and post a comment


private string replaceString = "";
public string Parse(string content)
{
const string regTagName = @"<.[^>]*>";

Regex reg = new Regex(@"(" + regTagName + ")|(geekzilla)",
RegexOptions.IgnoreCase | RegexOptions.Multiline);

// this is what I'd like to replace the match with
replaceString = "$1";

// do the replace
content = reg.Replace(content, new MatchEvaluator(MatchEval));

return content;
}

protected string MatchEval(Match match)
{
if (match.Groups[1].Success)
{
// the tag
return match.ToString();
}
if (match.Groups[2].Success)
{
// the text we're interested in
return Regex.Replace(match.ToString(), "(.+)", replaceString);
}
// everything else
return match.ToString();
}

No comments: