And if you don't know what type a given file is, they there are various way to determine it programmatically: http://www.rgagnon.com/javadetails/java-0487.html
An interesting article about Microsoft's binary file formats, especially DOC and XLS, is Why are the Microsoft Office file formats so complicated? (And some workarounds) It also mentions some alternatives to dealing with those formats directly.
... and some advice on how to convey information, courtesy of XKCD → → →
- Jackcess - library to read and write MDB files
- HXTT Access - commercial pure Java JDBC driver for MS Access
- JChm - library to read CHM files
- Apache Commons CSV and opencsv are libraries to read and write CSV files. CSV is not as easy to read and write as it first looks - once all the special cases are considered, one might as well use a library.
- POI - library to read and write XLS and XLSX files
- jXLS - library for writing XLS files based on templates
- Obba works with Excel spreadsheets on Windows
- Both GDBI and GenJ are dormant, although possibly still functional, and they should contain code to handle this format.
HDF (Hierarchical Data Format)
- Hapi (not to be confused with the node.js framework of the same name) is "an open-source, object-oriented HL7 2.x parser for Java"
Image and movie files
- ImageJ - Java image processing application and library that has plugins for lots of image file formats
- JIMI - library to read and write BMP, CUR, GIF, ICO, JPEG, PICT, PNG, PSD, Sun Raster, TGA, TIFF, XBM and XPM. There's a plugin for using JIMI with ImageJ, which also includes a couple of JIMI patches.
- GIF write, TIFF, RAW, PNM and JPEG2000 read/write support for ImageIO: JAI Image I/O Tools
- TwelveMonkeys - additions for the ImageIO API
- ImageIO-Ext also improves on ImageIO in various ways
- MP4 parser
- Apache Commons Imaging is a library that reads and writes a variety of image formats, including fast parsing of image info (size, color space, ICC profile, etc.) and metadata.
- Aspose is a commercial library that claims to support Photoshop and Illustrator manipulation and conversion to bitmap formats
- ini4j is a simple Java API for handling configuration files in Windows .ini format
- ODFDOM is a Java library for accessing ODF files.
- jOpenDocument.org has an open-source library for accessing all Open Document file types.
- Obba works with OpenOffice? spreadsheets
Office Open XML
- These are the XML-based Microsoft Office formats, standardized by ECMA, but implemented by Microsoft in a non-compliant way
- docx4j - create and edit docx documents using a JAXB content model matching the WordML schema
- Apache POI implements these formats.
OpenOffice? Java API
- OpenOffice? can read a number of file formats, and makes them accessible through its API. A starting point might be this article and of course the OO developer site
- Some introductory information about the OO file format can be found here
- JODConverter is a Java library that uses the OO Java API to perform document conversions between any formats supported by OO
Outlook / PST
- The Apache POI project developed some code that can read the textual contents of Outlook's MSG files. This page talks about that.
- JPST (commercial) can read and extract PST files.
- java-libpst is a pure Java library that can access 64bit PST files.
- PDF is a hard to read format. The best one can do is try to extract the text contained in a PDF file.
- OpenPDF is a library to create PDFs built on top of iText2, but still licensed under a business-friendly license. See OpenPDFExample for code to get you started - more examples - javadocs
- PDFBox - library that can create, merge, split and print PDFs, extract text, create images from PDFs, encrypt/decrypt PDFs, fill in PDF forms and more. See PdfBox for example code of how to use it to create a PDF.
- Boxable is built on top of PDFBox and makes it easier to add tables to PDFs
- FOP - library to create PDFs (and other formats) from XML by using XSL-FO transformations
- FlyingSaucer - library to convert CSS-styled XHTML to PDF
- JPedal - library for viewing and printing PDFs, can also extract text (how to print PDFs); commercial (the LGPL version provides PDF viewing only)
- PDFxStream - commercial library to extract text from PDFs
- PDF Renderer is a PDF viewer that renders using Java2D. Examples, Printing PDFs
- tabula-java can extract data from tables in PDFs
- pdf2dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed.
- ICEPdf is a set of commercial libraries that can render PDFs.
- Qoppa offers numerous commercial libraries for PDF-related tasks
- Aspose.Pdf for Java is a commercial library for reading and writing PDFs
- OrsonPDF is a fast, lightweight PDF generator for the Java platform
- iText is a commercial library for creating PDFs. An open source version is available under the business-unfriendly AGPL license.
- The Apache POI project developed some code that can open and (to a limited extent) edit PPT files. This page talks about it.
- The MPXJ library can work with several Project file formats.
QIF (used by Microsoft Money and Quicken)
- Buddi and Eurobudget are Java applications that can import and export QIF files (and thus contain code you may be able to use in your application). Both are licensed under the GPL.
- GnuCash is written in C, but also handles QIF, so its code may be useful information.
- jRTF can create RTFs
- iText 2 can create RTFs: jar file, javadocs
- JavaCC - is a lexer/parser for which an RTF grammar is available. From that an RTF reader can be constructed.
- The Apache POI project developed some code that can read Visio files. This page talks about that.
- POI - library to read and write DOC and DOCX files. It can also be used for extracting the text of a document.
- WordApi.exe is native Windows component with a Java interface, which lets you create Word documents, and alter word templates. Some impressions about it can be found here.
- Java2Word - library to create Word documents, especially reports, on the fly.
- If you encounter an obscure format for which no library is available, it may be feasible to create a reader for it if you have a file format description (which may be available on Wotsit, see link above). Several libraries, so-called lexers and parsers, are available that help in creating a reader, especially if the file format is ASCII, and not binary. You will need knowledge of regular expressions, though. Some file formats that have been tackled using this approach include RTF, CSV, HPGL and PBM/PGM/PPM. Lexers are easier to start with, but parsers can do more of the work for you. All these have ready-to-use examples on their web sites.
- Lexers: JFlex (introductory article in the JavaRanch? Journal)
- Parsers: Antlr, JavaCC