Register   |  Login

 

   Minimize

Enter Title

   Minimize

Introduction

Before I go in detail ,I want you to known what actually EfTidy is, EfTidy is Wrapper Component of Tidy Library and if you don?t know what is Tidy, here is little description.

?TidyLib is an open source utility for tidying up HTML. Tidy is composed from an HTML parser and an HTML pretty printer. The parser goes to considerable lengths to correct common markup errors. It also provides advice on how to make your pages more accessible to people with disabilities, and can be used to convert HTML content into XML as XHTML. Tidy is W3C open source and available free. It has been successfully compiled on a large number of platforms, and is being integrated into many HTML authoring tools.?  --By Mr. Dave Raggett 

So What I am doing with This Library

             Recently one of my company client requested us to make TidyAtl class for new TidyLibrary, as last ATL component or Active X wrapper for this Tidy library is built in  2002, So my company assign me task of creating  ATL Library for this component , After completion of the Component, my BOSS told me "Alok, this is open source component and other programmer deserve to use it ". So here I am, presenting you this Component with supporting source code and a brief overview of each function.

Component Reference

The EfTidy contain Four Interfaces :-

  • IEfTidyAttr   ( 2 Properties)
  • IEfTidyNode (1 Property and 4 Methods)
  • ItidyOption  ( 66 Properties)
  • ItidyCom     ( 5 Methods and 4 Properties)

And EfTidy also Contain Five Enumeration :-

 

CharEncodingType

typedef [public] enum tagCharEncodingType

{ ASCII, LATIN1, RAW, UTF8, ISO2022, MAC, WIN1252, UTF16LE, UTF16BE, UTF16, BIG5, SHIFTJIS }

CharEncodingType;

OutputType

typedef [public] enum tagOutputType
{
  XmlOut, /**< Create output as XML */
  XhtmlOut, /**< Output extensible HTML */
   HtmlOut /**< Output plain HTML, even for XHTML input.*/
 }OutputType;

IndentScheme

typedef [public] enum IndentScheme
{
NOINDENT=0,
INDENTBLOCKS,
AUTOINDENT
}IndentScheme;

DoctypeModes

typedef [public] enum { DoctypeOmit, /**< Omit DOCTYPE altogether */
DoctypeAuto, /**< Keep DOCTYPE in input. Set version to content */
DoctypeStrict, /**< Convert document to HTML 4 strict content model */
DoctypeLoose, /**< Convert document to HTML 4 transitional content model */
DoctypeUser /**< Set DOCTYPE FPI explicitly */
} DoctypeModes;

EfTidyMainNode

typedef [public] enum {

TIDY_ROOT, //Return Tidy ROOT Node
TIDY_HTML, //Return Tidy HTML Node
TIDY_HEAD, //Return Tidy HEAD Node
TIDY_BODY //Return Tidy BODY Node
}EfTidyMainNode;

 

Now Lets Take Each Interface one by one:-

1. ItidyCom-

First check out each every Method or property present in this interface, and function it perform.

 

Property/Method Name

Parameters

Get/Put

    Description

TidyFiletoMem (method)[in] BSTR sourceFile, [out, retval] BSTR* resultn/awrite output to memory
TidyFileToFile (method)[in] BSTR sourceFile, [in] BSTR destFilen/aWrite output in file
TidyMemToMem (method)[in] BSTR sourceStr, [out, retval] BSTR* resultn/aWrite output to memory
TidyMemtoFile (method)[in] BSTR buffer, [in] BSTR destFilen/aTake input as buffer and output in File
TotalWarnings (Property)([out, retval] long *pVal);GetReturn total number of warning after above four operation
TotalErrors (property)([out, retval] long *pVal);GetReturn total number of Errors after above four operation
ErrorWarning[out, retval] BSTR *pValGetReturn buffer, which contain human readable errors/ warnings.
Option (property)[out, retval] ItidyOption* *pValGetSet the Option for the tidy library
 EfTidyNode (method)[in]EfTidyMainNode Type,[out,retval]IEfTidyNode **ppNewEfTidyNoden/aAs html page has tree structure. This method returns you tidyNode,that assist you to read each every tag and its attribute.this is latest addition to tidy library

2. ItidyOption

here is list of properties for  ItidyOption Interface

 

Property/Method Name

Parameter

Get/Put  Description
LoadConfigFile (method)BSTRn/aLoad option settings from a configuration file
ResetToDefaultValueVoidn/aReset options to default settings
DoctypeBSTRBOTHDoctype declaration generated by Tidy
TidyMarkVARIANT_BOOLBOTHFor meta element indicating tidied doc
HideEndTagVARIANT_BOOLBOTHSuppress optional end tags
EncloseTextVARIANT_BOOLBOTHIf yes text at body is wrapped in <p>
EncloseBlockTextVARIANT_BOOLBOTHIf yes text in blocks is wrapped in <p>
LogicalEmphasisVARIANT_BOOLBOTHReplace i by em and b by strong
DefaultAltTextBSTRBOTHDefault text for alt attribute
CleanVARIANT_BOOLBOTHReplace presentational clutter by style rules
DropFontTagsVARIANT_BOOLBOTHDiscard presentation tags
DropEmptyParasVARIANT_BOOLBOTHDiscard empty p elements
Word2000VARIANT_BOOLBOTHBoth Draconian cleaning for Word2000
FixBadCommentVARIANT_BOOLBOTHBoth Fix comments with adjacent hyphens
FixBackslashVARIANT_BOOLBOTHBoth Fix URLs by replacing \ with /
NewEmptyTagsBSTRBOTHDeclared empty tags
NewInlineTagsBSTRBOTHDeclared inline tags
NewBlockLevelTagsBSTRBOTHDeclared block tags
NewPreTagsBSTRBOTHDeclared pre tags
OutputTypeOutputType *pValBOTHBoth You can set Output type from here Like you can get output as XML,XHtml or pure HTML
InputAsXMLVARIANT_BOOLBOTHTreat input as XML
ADDXmlDeclVARIANT_BOOLBOTHAdd >?xml ?< for XML docs
AddXmlSpaceVARIANT_BOOLBOTHIf set to yes adds xml: space attr as needed
BareVARIANT_BOOLBOTHMake bare HTML
AssumeXmlProcinsVARIANT_BOOLBOTHIf set to yes PIs must end with ?>
CharEncodingCharEncodingTypeBOTHSet/GET In/out character encoding
InCharEncodingCharEncodingTypeBOTHInput character encoding (if different)
OutCharEncodingCharEncodingTypeBOTHOutput character encoding (if different)
NumericsEntitiesVARIANT_BOOLBOTHUse numeric entities for symbols
QuoteMarksVARIANT_BOOLBOTHOutput " marks as &quot
QuoteNBSPVARIANT_BOOLBOTHBoth Output non-breaking space as entity
QuoteAmpersandVARIANT_BOOLBOTH Output naked ampersand as & amp 
OutputTagInUpperCaseVARIANT_BOOLBOTHOutput tags in upper not lower case
OutputAttrInUpperCaseVARIANT_BOOLBOTHOutput attributes in upper not lower case
WrapScriptletsVARIANT_BOOLBOTHWrap within JavaScript string literals
WrapAttValsVARIANT_BOOLBOTHWrap within attribute values
WrapSectionVARIANT_BOOLBOTHWrap within section tags
WrapAspVARIANT_BOOLBOTHWrap within ASP pseudo elements
WrapJsteVARIANT_BOOLBOTHWrap within JSTE pseudo elements
WrapPhpVARIANT_BOOLBOTHWrap within PHP pseudo elements
IndentIndentSchemeBOTHIndent content of appropriate tags
IndentSpacelongBOTHIndentation n spaces
WrapLenlongBOTHSet wrap margin for output
TabSizelongBOTHExpand tabs to n spaces
IndentAttributeslongBOTHNewline+indent before each attribute
BreakBeforeBRVARIANT_BOOLBOTHOutput newline before
or not
LiteralAttribsVARIANT_BOOLBOTHIf true attributes may use newlines
MarkUpVARIANT_BOOLBOTH 
ShowWarningsVARIANT_BOOLBOTHOn/Off
QuietVARIANT_BOOLBOTHNo 'Parsing X', guessed DTD or summary
KeepTimeVARIANT_BOOLBOTHIf yes last modied time is preserved
ErrorFileBSTRBOTHFile name to write errors to
GnuEmacsVARIANT_BOOLBOTHIf true format error output for GNU Emacs
FixUrlVARIANT_BOOLBOTHApplies URI encoding if necessary
BodyOnlyVARIANT_BOOLBOTHOutput BODY content only
HideCommentsVARIANT_BOOLBOTHHides all (real) comments in output
DoctypeModeDoctypeModesBOTHSet the doctype mode for output

3. IEfTidyNode

here is list of properties for IEfTidyNode Interface

Property/Method Name

Parameter

Get/Put  Description
NameBSTR *pValGetreturn the name of Current Tag.
GetFirstChildNodeIEfTidyNoden/aReturn First Child Node
GetNextChildNodeIEfTidyNoden/aUsing his you can enum rest of Tags
GetFirstAttributeIEfTidyAttrn/aReturn first Attribute of current Tag
GetNextAttributeIEfTidyAttrn/aReturn rest of Attribute one by one

4. IEfTidyAttr

here is list of properties for  IEfTidyAttr Interface

 

Property/Method Name

Parameter

Get/Put  Description
NameBSTR *pValGetName of attribute
ValueBSTR *pValGetValue of attribute

Using the code

      Almost every component was developed to use with Visual Basic and other COM friendly language. So all the code describes here is in visual basic.I am going to use some test case to explain working of component.

I have used the Test.htm (included with Project) to test EfTidy responses.

Here is what Test.htm contains

<html> 
<head>
<title>tidy Library</title> 
</head>
<body> 
<blockquote> 
<p> </p> --(1)
<p><fontsize="5"color=   
"#FF00FF">TidyLibrary</font></p></blockquote><P><p><fontsize="5"color="#FF00FF"></font></p>
<table border="1" cellpadding="0" cellspacing="0" 
style="border-collapse: collapse" bordercolor="#111111" width="100%" 
id="AutoNumber1">
<tr> 
<td width="50%" style="border-left-style: 
solid; border-left-width: 1; border-right-style: none; border-right-width: 
medium; border-top-style: solid; border-top-width: 1; border-bottom-style: 
none; border-bottom-width: medium"> --(2)
</td>
<td width="50%" style="border-left-style: none; border-left-width: medium;
border-right-style:solid; border-right-width: 1; border-top-style: solid;
border-top-width: 1;border-bottom-style: none; border-bottom-width: medium">
</td> 
</tr>
</table> 
<b>Tidy  --- (3)		
</h1> <tidy> ---(4)  
</body> 
</html>

 in test.htm I have added following mistake

        a Dummy<Tidy> tag at (4),

         missing <h1> tag at (4)

        empty Para <p> tag (1)

        unclosed <b> tag at (3)

Now Test Case # 1 using ITidyCOM

First Create Object to Our Component,here is listing how to achieve that.

		  Private Sub Form_Load() 
Dim TidyCOMObj as EFTIDYLib.tidyCom 
Set TidyCOMObj = New EFTIDYLib.tidyCom 
End Sub

Now Clean the test.htm file using this object , code listing for that is

		    Private Sub cmdMemtoMem_Click() 
Dim result As String  
TidyCOMObj.TidyFileToFile("test.htm","test1.htm")
?check No of error in the HTML 
txtError = TidyCOMObj.TotalErrors 
?check no of warning in above HTML 
txtWarning = TidyCOMObj.TotalWarnings 
End Sub

And here is the result produced by tidy Listing showing what test1.htm (created by EfTidyCom) contain

<html> 
<head> 
<meta name="generator" 
content= "HTML Tidy for Windows (vers 1st September 2004), see www.w3.org"> 
<title>tidy Library</title> 
</head>
<body> 
<blockquote> 
<p> </p> 
<p><font size="5" color="#FF00FF">Tidy Library</font>
</p> 
</blockquote> 
<p><font size="5" color= "#FF00FF">	</font></p> 
<table border="1" cellpadding="0" cellspacing="0" style= "border-collapse: 
collapse" bordercolor="#111111" width="100%" id= "AutoNumber1">
<tr> 
<td width="50%" style= "border-left-style: solid; border-left-width: 1; 
border-right-style: none; border-right-width: medium; 
border-top-style: solid; border-top-width: 1; border-bottom-style: none;
border-bottom-width: medium">
</td> 
<td width="50%" style= "border-left-style: none;border-left-width: medium;
border-right-style: solid; border-right-width: 1;border-top-style: solid; 
border-top-width: 1; border-bottom-style: none;border-bottom-width: medium"> 
</td> 
</tr> 
</table> 
<b>Tidy</b> --(1) 
</body> 
</html>

 if you see the Above cleaned HTML page - Dummy <tidy> tag and </h1> has been removed near (1) and </b> is added after Tidy  at (1)

here is Summary  of Error/Warning Produced By EfTidyCom ,showing detail of each action it has performed

		line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 22 column 10 - Warning: discarding unexpected </h1>
line 23 column 1 - Error: <tidy> is not recognized!
line 23 column 1 - Warning: discarding unexpected <tidy>
line 15 column 1 - Warning: <table> proprietary attribute "bordercolor"
line 15 column 1 - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary	5 warnings, 1 error were found!

 

Now Test Case # 2 using ITidyCOM.

   Now Apply some  Option to Test.htm get Custom Output. so i am using these Options

  • Clean =TRUE( to make separate class for style)
  • DoctypeMode = DoctypeUser  (to enable display string)
  • Doctype = "Ef Tidy library"< /STRONG >    (Display string)
  • OutputType = XhtmlOut       (output type)
  • NewInlineTags = "tidy"          (Make our Dummy <tidy>tag Legal )

 Here is Code Listing to achieve above

Private Sub cmdMemtoMem_Click() 
Dim me1 As String 
TidyCOMObj.Option.Clean = True 
TidyCOMObj.Option.NewInlineTags = "tidy" 
TidyCOMObj.Option.OutputType = 	XhtmlOut 
'our string shown in the cleaned html
'only if the doctype mode is User
TidyCOMObj.Option.DoctypeMode = DoctypeUser 
TidyCOMObj.Option.Doctype = "Ef Tidy library" 
TidyCOMObj.TidyFileToFile("test.htm","test1.htm") 
txtError = TidyCOMObj.TotalErrors 
txtWarning = TidyCOMObj.TotalWarnings 
End Sub

And here is the result produced by tidy Listing showing what test1.htm (created by EfTidyCom) contain after applying out options

<!DOCTYPE html PUBLIC "Ef Tidy library" ""> --(1)  
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" 
content="HTML Tidy for Windows (vers 1st September 2004), see www.w3.org" />
<title>tidy Library</title>
<style type="text/css">  --(2)
/*<![CDATA[*/
table.c4 {border-collapse: collapse}
td.c3 {border-left-style: none; border-left-width: medium; border-right-style: solid; 
border-right-width: 1; border-top-style: solid; border-top-width: 1; 
border-bottom-style: none; border-bottom-width: medium}
td.c2 {border-left-style: solid; border-left-width: 1; border-right-style: none; 
border-right-width: medium; border-top-style: solid; border-top-width: 1;
border-bottom-style: none; border-bottom-width: medium}
h2.c1 {color: #FF00FF}
/*]]>*/
</style>
</head>
<body>
<blockquote>
<p> </p>
<h2 class="c1">Tidy Library</h2>
</blockquote>
<h2 class="c1">
</h2>
<table border="1" cellpadding="0" cellspacing="0" class="c4"
bordercolor="#111111" width="100%" id="AutoNumber1">
<tr>
<td width="50%" class="c2"> </td> ----(3)
<td width="50%" class="c3"> </td>
</tr>
</table>
<b>Tidy <tidy></tidy></b> ----(4)

</body>
</html>

Now Let see What Tidy Clean for us

  • In (1) our Custom string "EfTidyCom" is visible
  • In (2) and (3) style are cleaned and class is created for that
  • In (4) our <Tidy> tag get legal,though it do nothing in actual HTML page

here is summary of all the Error/Warning

		line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 22 column 10 - Warning: discarding unexpected </h1>
line 23 column 1 - Warning: <tidy> is not approved by W3C
line 23 column 1 - Warning: missing </tidy> before </body>
line 22 column 2 - Warning: missing </b> before </body>
line 15 column 1 - Warning: <table> proprietary attribute "bordercolor"
line 15 column 1 - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary
7 warnings, 0 errors were found!
Now Test Case # 3 Using IEftidyNode and IEfTidyAttr.

This two Interface will help you gather node by node and Attribute by attribute information from Tree Structure of Html cleaned by Tidy libraray. here is code listing for Finding the <Head> tag and Enumerate all the Attribute in that.

  Note :always use the these two interface on html cleaned by Tidy.

Private Sub cmdGetNode_Click()
?assuming TidyDoc contain Cleaned HTML
?after applying any of four ITidyCom method
?here TidyDoc is Object of iTidyCom
a = TIDY_HEAD
?give the <head> Node
Set tidyNode = TidyDoc.EfTidyNode(a)
?display name
txtNodeName = tidyNode.Name
If tidyNode Is Nothing Then
Else
?Enumerate all attribute in the head if any
Set atr = tidyNode.GetFirstAttribute
Do Until atr Is Nothing
lstAttr.AddItem atr.Name & "   " & atr.Value
Set atr = tidyNode.GetNextAttribute
Loop
End If
End Sub

Now how to Enumerate child in the Head Node and get attribute of each, I am finding first child for you here, the code listing for that is -->

 

Private Sub cmdGetFirstChildNode_Click()
Dim localnode As EfTidyNode
Set localnode = tidyNode.GetFirstChildNode
txtNodeName = localnode.Name
If localnode Is Nothing Then
Else
Set atr = localnode.GetFirstAttribute
Do Until atr Is Nothing
lstAttr.AddItem atr.Name & "   " & atr.Value
Set atr = localnode.GetNextAttribute
Loop
End If
End Sub

wait a min, I has shot a nice snapshot after clicking on clicking on above code button

 Here,All i have given small overview of tidyLibrary and EfTidyCom.For more information about Tidy library visit tidy Home Page http://tidy.sourceforge.com

Author Comment

   I know there is much scope for improvement in this Component especially in Interfaces IEfTidyNode and IEfTidyAttr. I promise these improvement will there in next version/update of library

History

Keep a running update of any changes or improvements you've made here.

Files Listing With Project

Source File Contains -

  • TidyLib (original Tidy Library) Source Code
  • TidyLib (original Tidy Library) Source Code
Project file Contains
  • Release version of EfTidy Component
  • Visual Basic Test project for ItidyCom & ItidyOption (with source)
  • Visual Basic test project for iTidyNode and iTidyAttr(with Source code)
  • Test.htm

Update History

  • 28 November 2004 : EfTidy version 1.0 Introduced.

Special Thanks

  • My Boss Mr Saurabh Gupta Director Efextra eSolutions Pvt Ltd
  • Paul E. Bible For his CCOMString Class.
  • Tidy SourceForgeGroup for this nice library i.e. Tidylibrary

 

   Minimize

 

   Minimize