The PDF (Portable Document Format) file is used to save text/image data for offline use. Sometimes PDF file is used to display text/graphics content on the web page for online use. Generally, a web viewer is used to embed PDF files on the browser. When a PDF file is embedded on the web page, the text/graphics content is not appended to the HTML page. Since the PDF content is not rendered on the web page, it causes a negative impact on SEO. To overcome this issue, you can extract text content from PDF and include it on the web page.
PDF Parser library is very helpful to extract elements from PDF files using PHP. This PHP library parses PDF files and extracts text contents from all the pages. The object, headers, metadata, and text can be parsed from the PDF file using PHP. This tutorial will show you how to extract text from PDF files using PHP.
In this example script, we will use the PDF Parser library to extract text from PDF with PHP. Also, we will show how you can upload PDF files and extract text data on the fly using PHP.
Run the following command to install PDF Parser library using composer.
composer require smalot/pdfparser
Note that: You don’t need to download the PDF Parser library separately, all the required files are included in the source code. Download the source code if you want to install and use PDF Parser without composer.
Include autoloader to load PDF Parser library and helper functions in the PHP script.
include 'vendor/autoload.php';
The following code snippet extracts all the text content from PDF file using PHP.
parseFile()
function of the PDF Parser class.getText()
method of the PDF Parser class.// Initialize and load PDF Parser library
$parser = new \Smalot\PdfParser\Parser();
// Source PDF file to extract text
$file = 'path-to-file/Brochure.pdf';
// Parse pdf file using Parser library
$pdf = $parser->parseFile($file);
// Extract text from PDF
$textContent = $pdf->getText();
This example code snippet shows you the step-by-step process to upload PDF files and extract the text using PHP.
PDF File Upload Form:
Define HTML elements for file uploading form.
<form action="submit.php" method="post" enctype="multipart/form-data">
<div class="form-input">
<label for="pdf_file">PDF File</label>
<input type="file" name="pdf_file" placeholder="Select a PDF file" required="">
</div>
<input type="submit" name="submit" class="btn" value="Extract Text">
</form>
On form submission, the selected file is submitted to the server-side script for process further.
Server-side Script (submit.php) to Extract Text from Uploaded PDF:
The following code is used to upload the submitted file and extract text from PDF.
pathinfo()
function with PATHINFO_EXTENSION filter.tmp_name
in $_FILES.\n
) with line break (<br/>) using nl2br() function in PHP.$pdfText = '';
if(isset($_POST['submit'])){
// If file is selected
if(!empty($_FILES["pdf_file"]["name"])){
// File upload path
$fileName = basename($_FILES["pdf_file"]["name"]);
$fileType = pathinfo($fileName, PATHINFO_EXTENSION);
// Allow certain file formats
$allowTypes = array('pdf');
if(in_array($fileType, $allowTypes)){
// Include autoloader file
include 'vendor/autoload.php';
// Initialize and load PDF Parser library
$parser = new \Smalot\PdfParser\Parser();
// Source PDF file to extract text
$file = $_FILES["pdf_file"]["tmp_name"];
// Parse pdf file using Parser library
$pdf = $parser->parseFile($file);
// Extract text from PDF
$text = $pdf->getText();
// Add line break
$pdfText = nl2br($text);
}else{
$statusMsg = '<p>Sorry, only PDF file is allowed to upload.</p>';
}
}else{
$statusMsg = '<p>Please select a PDF file to extract text.</p>';
}
}
// Display text content
echo $pdfText;
You can use this PDF parser library for various needs. Here are some advanced uses to further customize the PDF parser and text output.
Extract the text of a specific page from PDF:
// extract the text of a specific page (in this case the first page)
$text = $pdf->getPages()[0]->getText();
Extract the text of a limited amount of pages from PDF:
// extract text of a limited amount of pages. here, it will only use the first two pages.
$text = $pdf->getText(2);
Extract metadata from PDF:
$metaData = $pdf->getDetails();
Array
(
[Title] => Brochure
[Producer] => Skia/PDF m94 Google Docs Renderer
[Pages] => 2
...
)
Add Watermark to Existing PDF using PHP
Do you want to get implementation help, or enhance the functionality of this script? Click here to Submit Service Request
Can it read text from pdf tables?
Thanks for Extract Text from PDF using PHP
My question is can we save this extracted text into database ?