In this Blog, I'm going to show, how to read/extract data from a PDF using Java Program.Many times we have need to reading PDF and doing some work with PDF data.
In Java, we have an API "PDF BOX" for doing this work easily.PDF BOX API is provided by Apache and it is open source API. It helps us to create, delete and manipulate a PDF document in the application.
Before Writing a sample program I'm giving you brief Detail about this API.
What Is PDF BOX?
Apache PDFBox is a free Java library that helps the improvement and change of PDF papers. Using this library, you can write Java programs that produce, transform and manipulate PDF papers. In addition to this, PDFBox also introduces a command line utility for executing various controls over PDF utilizing the available Jar file.
Features of PDFBox :
Following are the important characteristics of PDFBox −
Extract Text − With the help of PDFBox, you can extract Unicode text from PDF documents.
Break & Mix − With the help of PDFBox, you can divide an individual PDF document into multiple documents, and mix them back into a single document.
Fill Forms − With the help of PDFBox, you can fill the application data in a document.
Print − With the help of PDFBox, you can print a PDF file using the official Java printing API.
Save as Image − With the help of PDFBox, you can save PDFs as image files, such as PNG or JPEG.
Create PDFs − With the help of PDFBox, you can create a new PDF file by building Java applications and, you can also insert images and fonts.
Signing − With the assistance of PDFBox, you can add computerized signs to the PDF records.
Components of PDFBox
The following are the four main components of PDFBox −
PDFBox − This includes the classes and interfaces associated to data extraction and manipulation.
FontBox − This includes the classes and interfaces related to font, and using these classes we can change the font of the text of the PDF document.
XmpBox − This includes the classes and interfaces that manipulate XMP metadata.
Preflight − This part is used to check the PDF files upon the PDF/A-1b measure.
Sample Program for Printing PDF file Data using Java
package com.sanjay; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import org.apache.pdfbox.text.PDFTextStripperByArea; import java.io.File; import java.io.IOException; public class PrintPdf < public static void main(String[] args) throws IOException < try (PDDocument pdfDocument = PDDocument.load(new File("F:/Test.pdf"))) < pdfDocument.getClass(); if (!pdfDocument.isEncrypted()) < PDFTextStripperByArea pdfTextStripperByArea = new PDFTextStripperByArea(); pdfTextStripperByArea.setSortByPosition(Boolean.TRUE); PDFTextStripper pdfTextStripper = new PDFTextStripper(); String pdfFileInText = pdfTextStripper.getText(pdfDocument); String lines[] = pdfFileInText.split("\\r?\\n"); for (String line : lines) < System.out.println(line); >> > > >
Thanks