Extract and Strip Text From PDF in Java Example


Extracting text from a pdf file using Java is quite easy using the Apache PDFBox Java library. This library provides PDFTextStripper class which is used to strip text from PDF files. This library can be included using Gradle, maven, and other builds systems from the Maven repository.

Gradle dependency:

// https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox
compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.21'

Maven dependency:

<!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox -->
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.21</version>
</dependency>

Example to Extract text from a pdf file:

public class ExtractPdfText {

	public static void main(final String[] args) throws IOException {
		if (args.length != 1) {
			System.out.println("Please pass file path in aurgument");
		}

		try (PDDocument document = PDDocument.load(new File(args[0]))) {
			final AccessPermission ap = document.getCurrentAccessPermission();
			if (!ap.canExtractContent()) {
				throw new IOException("You do not have permission to extract text");
			}

			final PDFTextStripper stripper = new PDFTextStripper();

			// This example uses sorting, but in some cases, it is more useful to switch it
			// off,
			// e.g. in some files with columns where the PDF content stream respects the
			// column order.
			stripper.setSortByPosition(true);

			for (int p = 1; p <= document.getNumberOfPages(); ++p) {
				// Set the page interval to extract. If you don't, then all pages would be
				// extracted.
				stripper.setStartPage(p);
				stripper.setEndPage(p);

				// let the magic happen
				final String text = stripper.getText(document);

				// do some nice output with a header
				final String pageStr = String.format("page %d:", p);
				System.out.println(pageStr);
				for (int i = 0; i < pageStr.length(); ++i) {
					System.out.print("-");
				}
				System.out.println();
				System.out.println(text.trim());
				System.out.println();

			}
		}
	}
}

In stripper PDFTextStripper class you can set page range and sort by text position.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.