Extracting text from a pdf file using Java is quite easy using the Apache PDFBox Java library. This library provides PDFTextStripper class which is used to strip text from PDF files. This library can be included using Gradle, maven, and other builds systems from the Maven repository.
Gradle dependency:
// https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.21'
Maven dependency:
<!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox --> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>2.0.21</version> </dependency>
Example to Extract text from a pdf file:
public class ExtractPdfText { public static void main(final String[] args) throws IOException { if (args.length != 1) { System.out.println("Please pass file path in aurgument"); } try (PDDocument document = PDDocument.load(new File(args[0]))) { final AccessPermission ap = document.getCurrentAccessPermission(); if (!ap.canExtractContent()) { throw new IOException("You do not have permission to extract text"); } final PDFTextStripper stripper = new PDFTextStripper(); // This example uses sorting, but in some cases, it is more useful to switch it // off, // e.g. in some files with columns where the PDF content stream respects the // column order. stripper.setSortByPosition(true); for (int p = 1; p <= document.getNumberOfPages(); ++p) { // Set the page interval to extract. If you don't, then all pages would be // extracted. stripper.setStartPage(p); stripper.setEndPage(p); // let the magic happen final String text = stripper.getText(document); // do some nice output with a header final String pageStr = String.format("page %d:", p); System.out.println(pageStr); for (int i = 0; i < pageStr.length(); ++i) { System.out.print("-"); } System.out.println(); System.out.println(text.trim()); System.out.println(); } } } }
In stripper PDFTextStripper class you can set page range and sort by text position.