..
In Re: State and Federal Cases and Codes
README FILE
This file is
http://bulk.resource.org/courts/gov/0README.html and was last revised on
Fri Aug 17 14:41:38 PDT 2007. The goals of this project are:
- The short-term goal is the creation of an unencumbered full-text repository of the
Federal Reporter,
the Federal Supplement,
and the Federal Appendix.
- The medium-term goal is the creation of an unencumbered full-text repository of all
state and federal cases and codes.
Some things to think about:
- “Works of the federal government are not subject to copyright protection; the text of judicial
decisions may therefore be copied at will.”
Matthew
Bender & Co., 158 F.3d 674 (2nd Cir. 1998), ¶ 13.
- “Anyone searching for, analyzing, and then citing authority relevant
to a current matter must work with the old as well as the new.”
Peter W. Martin, Neutral Citation, Court Web Sites and Access to Authoritative Case
Law, 99 Law Libr. J. 329 (2007) at ¶ 30.
- “It may be proper to remark that the court are unanimously of opinion, that
no reporter has or can have any copyright in the written opinions delivered by
this court; and that the judges thereof cannot confer on any
reporter any such
right.”
Wheaton v.
Peters, 33 U.S. (8 Pet.) 591, 668 (1834).
- “The judicial opinions of both state and federal courts are in the public domain and are therefore not subject to copyright.”
West Publishing Company v. Mead Data Central,
Inc., 799 F.2d 1219 (8th Cir. 1986)
(Concurrence of Judge Oliver quoting Nimmer on Copyrights).
This distribution directory contains the following items:
- 2f.ultrafiche.tif is a 3.6-gigabyte, uncompressed TIFF file (43,891
by 31,243 pixels). The original material is an "ultrafiche", which measures
4" x 6" and features up to 1000 pages of text in an 87x reduction (photo).
Our test file is an 8 bit per sample RGB (e.g., 24 bits per pixel) scan at 8000 dpi
using a web mount on a drum scanner. The scan is of volume 2 of the first edition
of the Federal Reporter.
- In addition to the composite file, the first 50 pages have been cut and
pasted into files of the format 2f.[num].tif where num=1,50. These are raw files
with no post-processing.
A series of open source tools are being used to whack this data:
- A variety of image processing tools are being used, including
ImageMagick
and the Gimp. The images in this
directory are raw, but our workflow further processes these images
through the use of an unsharp mask, adjustment of RGB levels to knock
out the background, and up-ressing the image to an effective resolution
of 300 dpi for use in OCR.
- Our OCR engine is
Tesseract,
which was developed by HP through 1995, lay fallow for a decade,
and has recently been revived. Some early tests with a
greyscale
sample yielded very satisfactory initial results.
- For layout analysis (e.g., identification of indented paragraphs,
header lines, paragraph breaks), we are experimenting with Google's
OCRopus.
This initial release of test data will be supplemented over the next
few months with additional scans. Our goals are to provide full scans
of the 300 volumes of the first series of the Federal Reporter over the
next few months, to be
followed by the remaining volumes of the Reporters, Supplements, and Appendices.
The scanner we are using is an
AZTEK drum scanner
with the "KAMI" wet mount.
This scanner works over a large area (e.g., our 4" x 6" ultrafiches
as opposed to a 35mm slide) at a full 8000 dpi, which is a 3 micron
pixel size. The highest-end flatbed/ccd scanners, such as the CREO,
have a 5000 dpi optical resolution, which is insufficient to get the
text out of an ultrfiche at high enough resolution suitable for OCR.
** Attention: Law Librarians ** To recycle your used ultrafiche or books:
- All shipments are CPT Destination.
- Please obtain an RMA number from us before making your donation.
Carl AT media.org for Public.Resource.Org