2010년 8월 25일 수요일

[프로그래밍] Regular Expression: ASP Strip Tags



This week, I have been mostly coding MAZIN. You will find out what that means when listening to Ram FM soon.


As part of the project, I had to create a page with a WYSIWYG editor (also known as Rich Text Editors) that would allow users to compose copy that may or may not include simple HTML tags such as bold, italic, lists, breaks and paragraphs.


As all sites we develop these days are XHTML based and standards compliant, I found that FCKEditor was the best choice - even though it is what Rich would call “Bloatware” - i.e, it’s rediculously large in terms of directories/language files/etc. That reminds me, I need to go through it and delete all the unwanted languages and plugin-in scripts before going live.


The problem with FCKEditor is that it will still allow users to post HTML that is not allowed i.e, you have told FCKEditor that you only want users to be able to make text bold, italic or whatever. This means that when we send the form, we need a strip_tags function, like PHP has.


Haven’t you already posted an ASP one!? I hear you ask. Well, I did. But the PHP version of strip_tags allows you to specify which tags you want to remain: www.php.net/strip_tags


After some researching, it seems that nobody has come up with an ASP version of this sweet function, so I wrote my own et voila:

'	=============================================================================================================

' @name stripHTML
' @desc strips all HTML from code except for tags seperated by commas in the param "allowedTags"
' @returns string
' =============================================================================================================
function stripHTML(strHTML, allowedTags)

dim objRegExp, strOutput
set objRegExp = new regexp

strOutput = strHTML
allowedTags = "," & lcase(replace(allowedTags, " ", "")) & ","

objRegExp.IgnoreCase = true
objRegExp.Global = true
objRegExp.MultiLine = true
objRegExp.Pattern = "< (.|
)+?>" ' match all tags, even XHTML ones
set matches = objRegExp.execute(strHTML)
objRegExp.Pattern = "< (/?)(w+)[^>]*>"
for each match in matches
tagName = objRegExp.Replace(match.value, "$2")
tagName = "," & lcase(tagName) & ","

if instr(allowedTags,tagName) = 0 then
strOutput = replace(strOutput, match.value, "")
end if
next

stripHTML = strOutput 'Return the value of strOutput
set objRegExp = nothing
end function



Usage is simple, just do:


html = stripHTML(html, "b,i,strong,em,p,br")


Where b, i, strong, em, p and br are the tags you are allowing.


That’s all for now





Useful Regular Expressions in ASP

ASP, Regular Expressions No Comments »

While working on an ASP ticket system today that required regular expressions, I came up with a couple of useful regular expression patterns that may save people a few hours of thinking time.


Matching and extracting a string


Problem: I have the following chunk of arbitrary text and I want to extract the order number prefixed “ORD_”:


The quick brown fox... ORD_1012345678 ...jumped over the lazy dog


Solution: ORD_[a-zA-Z0-9_-]*


What is going on? Well, quite simply the regular expression engine is being asked to match the first three letters “ORD” followed by an underscore “_”. It then requires a series (*) of letters, numbers, underscores or dashes (but nothing else). Therefore, once the regular expression engine has found the order number “ORD_1012345678″ and then it comes to a whitespace, new line, period or whatever - it stops parsing.


ASP VBScript Code:

Set regEx = New RegExp

With regEx
.Pattern = "ORD_[a-zA-Z0-9_-]*"
.IgnoreCase = true
.Global = false
End With
set matches = regEx.Execute(text)
if matches.count > 0 then
result = matches.item(0).value
end if


The string “ORD_1012345678″, extracted from the chunk of text, will be stored in the variable “result”


A very similar version of string extraction


Problem: I have the following chunk of arbitrary text and I want to extract the ID number in square brackets (prefixed “[#”):


The quick brown fox jumped over the lazy dog [#101234-56789]


Solution: [#([a-zA-Z0-9_-]*)


What is going on? In a similar way to the first one, this regular expression match pattern is asking for a square bracket followed by a hash “[#” - but because the opening square bracket is a reserved character (used to define sets), we have to escape it with a backwards slash before hand. We then surround the series of allowed characters with parenthesis ( ) which groups the match as a “sub match”.


ASP VBScript Code:

Set regEx = New RegExp

With regEx
.Pattern = "[#([a-zA-Z0-9_-]*)"
.IgnoreCase = true
.Global = false
End With
set matches = regEx.Execute(text)
if matches.count > 0 then
result = matches(0).subMatches(0)
end if


The ID number “101234-56789″ will be stored in “result”


The important difference to note in this code is the use of “subMatches(0)” which returns the first match found in the brackets.


Stripping HTML tags


This function can be used to strip HTML tags from a string. It is very similar to the PHP function strip_tags(); but this one is not as advanced (yet).


A more advanced version is now available here


Let’s just jump straight to the code, you don’t really need to know what is going on (you can probably guess anyway)…


ASP VBScript Code:

function stripTags(strHTML)

dim regEx
Set regEx = New RegExp
With regEx
.Pattern = "< (.|
)+?>"
.IgnoreCase = true
.Global = false
End With
stripTags = regEx.replace(strHTML, "")
end function


Trimming unwanted whitespace


If you want to trim unwanted whitespace from a string, e.g: turning “Text[space]spaced[space]normally[space][space][space]or[space][space]not?” into: “Text[space]spaced[space]normally[space]or[space]not?” use the following method:

function trimWhitespace(strIn, singleSpacing)

dim regEx
Set regEx = New RegExp
With regEx
.Pattern = "s+"
.IgnoreCase = true
.Global = false
End With
if singleSpacing then
space = " "
else
space = ""
end if
trimWhitespace = regEx.replace(strIn, space)
end function


When set to false, the second parameter “singleSpacing” will simply remove all whitespaces from a string, giving: “Textspacednormallyornot?”


I hope the above examples help someone!


You may find the following websites useful, I certainly did!


댓글 없음:

댓글 쓰기