Conversion from SGML to HTML - Proof of Concept
The primary purpose of this proof of concept was to test the feasibility of automatically
converting SGML from the Office of the Federal Register into HTML for presentation.
A secondary goal was to convert SGML to XML to provide a document format which can be further
manipulated with existing XML tools.
This proof of concept took about one week to program, plus another day of 'tweaking'.
The Proof Of Concept:
The original SGML file: 203149T.txt
The resulting XML (with stylesheet): TransformedOutput.xml
(open it up and then click on "view-source" to see the XML created by the conversion tool).
The XSL stylesheet: Environment.xsl
Notes on the XSL stylesheet:
- XSL is mostly used only for where XTHML tags are required (horizontal rules, mostly).
- Cascading Style Sheets are used for all presentation. See the CSS here.
- Note that CSS class names are equal to the original SGML tags. An XSL 'hack' converts SGML tags into class names in the XHTML to work around a bug in Internet Explorer 7.
A fully-XHTML version: TransformedOutputHTML.xml
(After clicking it, you may need to choose "view-source" to see the XHTML which was generated by the XSL stylesheet)
To create a production version:
- Rewrite the POC tool to allow to better handle "top-down" tags which enclose large portions of text, for example CFRTOC, SECTION, CHAPTER, etc.
- Write XSL to handle SGML tables and convert them (as best as possible) into XHTML tables.
- Provide handling of embedded images.
- Generate CSS for all titles and all SGML tags.
- Test on all years of the CFR.
Note: Mrudula Munagala was the original programmer of the POC conversion tool.