Text ExtractionThe text extraction example will extract all text data from a PDF document. The document.getPageText() will initialized the page content and parse text content only ignoring images and other none vital data structures. // create a file to write the extracted text to File file = new File("extracted_text.txt"); FileWriter fileWriter = new FileWriter(file); // Get text from the first page of the document, assuming that there // is text to extract. for (int pageNumber = 0, max = document.getNumberOfPages(); pageNumber < max; pageNumber++) { PageText pageText = document.getPageText(pageNumber); System.out.println("Extracting page text: " + pageNumber); if (pageText != null && pageText.getPageLines() != null) { ArrayList<LineText> pageLines = pageText.getPageLines(); for (LineText lineText : pageLines) { fileWriter.write(lineText.toString()); fileWriter.write('\n'); } } } // close the writer fileWriter.close(); The source-code for this example is located at:
A primer on using Maven or Gradle build commands can be found here (Maven) and here (Gradle) |
Text Extraction
© Copyright 2017 ICEsoft Technologies Canada Corp.