Text Extraction

Table of Contents

Text Extraction

The text extraction example will extract all text data from a PDF document. The document.getPageText() will initialized the page content and parse text content only ignoring images and other none vital data structures.

// create a file to write the extracted text to
File file = new File("extracted_text.txt");
FileWriter fileWriter = new FileWriter(file);

// Get text from the first page of the document, assuming that there
// is text to extract.
for (int pageNumber = 0, max = document.getNumberOfPages();
     pageNumber < max; pageNumber++) {
    PageText pageText = document.getPageText(pageNumber);
    System.out.println("Extracting page text: " + pageNumber);
    if (pageText != null && pageText.getPageLines() != null) {
        ArrayList<LineText> pageLines = pageText.getPageLines();
        for (LineText lineText : pageLines) {
            fileWriter.write(lineText.toString());
            fileWriter.write('\n');
        }
    }
}

// close the writer
fileWriter.close();

The source-code for this example is located at:

A primer on using Maven or Gradle build commands can be found here (Maven) and here (Gradle)

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

© Copyright 2017 ICEsoft Technologies Canada Corp.