..

In Re: State and Federal Cases and Codes

Released by Public.Resource.Org as Test Data

README FILE

This file is http://bulk.resource.org/courts/gov/0README.html and was last revised on Fri Aug 17 14:41:38 PDT 2007. The goals of this project are:

  1. The short-term goal is the creation of an unencumbered full-text repository of the Federal Reporter, the Federal Supplement, and the Federal Appendix.
  2. The medium-term goal is the creation of an unencumbered full-text repository of all state and federal cases and codes.

Some things to think about:

  1. “Works of the federal government are not subject to copyright protection; the text of judicial decisions may therefore be copied at will.” Matthew Bender & Co., 158 F.3d 674 (2nd Cir. 1998), ¶ 13.
  2. “Anyone searching for, analyzing, and then citing authority relevant to a current matter must work with the old as well as the new.” Peter W. Martin, Neutral Citation, Court Web Sites and Access to Authoritative Case Law, 99 Law Libr. J. 329 (2007) at ¶ 30.
  3. “It may be proper to remark that the court are unanimously of opinion, that no reporter has or can have any copyright in the written opinions delivered by this court; and that the judges thereof cannot confer on any reporter any such right.” Wheaton v. Peters, 33 U.S. (8 Pet.) 591, 668 (1834).
  4. “The judicial opinions of both state and federal courts are in the public domain and are therefore not subject to copyright.” West Publishing Company v. Mead Data Central, Inc., 799 F.2d 1219 (8th Cir. 1986) (Concurrence of Judge Oliver quoting Nimmer on Copyrights).

This distribution directory contains the following items:

  1. 2f.ultrafiche.tif is a 3.6-gigabyte, uncompressed TIFF file (43,891 by 31,243 pixels). The original material is an "ultrafiche", which measures 4" x 6" and features up to 1000 pages of text in an 87x reduction (photo). Our test file is an 8 bit per sample RGB (e.g., 24 bits per pixel) scan at 8000 dpi using a web mount on a drum scanner. The scan is of volume 2 of the first edition of the Federal Reporter.

  2. In addition to the composite file, the first 50 pages have been cut and pasted into files of the format 2f.[num].tif where num=1,50. These are raw files with no post-processing.

A series of open source tools are being used to whack this data:

  1. A variety of image processing tools are being used, including ImageMagick and the Gimp. The images in this directory are raw, but our workflow further processes these images through the use of an unsharp mask, adjustment of RGB levels to knock out the background, and up-ressing the image to an effective resolution of 300 dpi for use in OCR.
  2. Our OCR engine is Tesseract, which was developed by HP through 1995, lay fallow for a decade, and has recently been revived. Some early tests with a greyscale sample yielded very satisfactory initial results.
  3. For layout analysis (e.g., identification of indented paragraphs, header lines, paragraph breaks), we are experimenting with Google's OCRopus.

This initial release of test data will be supplemented over the next few months with additional scans. Our goals are to provide full scans of the 300 volumes of the first series of the Federal Reporter over the next few months, to be followed by the remaining volumes of the Reporters, Supplements, and Appendices.

The scanner we are using is an AZTEK drum scanner with the "KAMI" wet mount.

This scanner works over a large area (e.g., our 4" x 6" ultrafiches as opposed to a 35mm slide) at a full 8000 dpi, which is a 3 micron pixel size. The highest-end flatbed/ccd scanners, such as the CREO, have a 5000 dpi optical resolution, which is insufficient to get the text out of an ultrfiche at high enough resolution suitable for OCR.

** Attention: Law Librarians ** To recycle your used ultrafiche or books:

  1. All shipments are CPT Destination.
  2. Please obtain an RMA number from us before making your donation.

Carl AT media.org for Public.Resource.Org