php – How to make a large xml file readable?

Question:

There is a 400MB XML file that needs to be rendered readable.

Unfortunately all the console utilities I have found preload the file into memory and cannot process my file.

Can anyone have an example of a script or a console utility for such cases?

Tried xmllint, crashes after a few seconds:

xmllint --format --shell in.xml > tmp.xml

Thanks!

Answer:

  1. How to make a large xml file readable? If I understood correctly, the first thing that needs to be done is to read data from this file, the second is to generate an output, the third is to write or output information
  2. preload the file into memory and cannot process my file … If I understood correctly, the main problem is that the file is too large and there is not enough memory to read it

If I misunderstood you, then I ask you to give clarifying information and in this case I will correct the answer, and if I understand correctly, then I suggest using the PHPOffice / PhpSpreadsheet library.It was created on the basis of the PHPExcel library, which was very slow, but supported many formats, including including xml. What caused the choice of this library? It allows large files to be read in portions. I give an example of the author's code on using the library for reading multiple lines

    <?php

error_reporting(E_ALL);
set_time_limit(0);

date_default_timezone_set('Europe/London');

?>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

<title>PHPExcel Reader Example #11</title>

</head>
<body>

<h1>PHPExcel Reader Example #11</h1>
<h2>Reading a Workbook in "Chunks" Using a Configurable Read Filter (Version 1)</h2>
   <?php

/** Include path **/
set_include_path(get_include_path() . PATH_SEPARATOR . '../../../Classes/');

/** \PhpOffice\PhpSpreadsheet\IOFactory */
include 'PHPExcel/IOFactory.php';

$inputFileType = 'Xls';
//  $inputFileType = 'Xlsx';
//  $inputFileType = 'Xml';
//  $inputFileType = 'Ods';
//  $inputFileType = 'Gnumeric';
$inputFileName = './sampleData/example2.xls';

/**  Define a Read Filter class implementing \PhpOffice\PhpSpreadsheet\Reader\IReadFilter  */
class chunkReadFilter implements \PhpOffice\PhpSpreadsheet\Reader\IReadFilter
{
    private $_startRow = 0;

    private $_endRow = 0;

    /**
     * We expect a list of the rows that we want to read to be passed into the constructor.
     *
     * @param mixed $startRow
     * @param mixed $chunkSize
     */
    public function __construct($startRow, $chunkSize)
    {
        $this->_startRow = $startRow;
        $this->_endRow = $startRow + $chunkSize;
    }

    public function readCell($column, $row, $worksheetName = '')
    {
        //  Only read the heading row, and the rows that were configured in the constructor
        if (($row == 1) || ($row >= $this->_startRow && $row < $this->_endRow)) {
            return true;
        }

        return false;
    }
}

echo 'Loading file ',pathinfo($inputFileName, PATHINFO_BASENAME),' using IOFactory with a defined reader type of ',$inputFileType,'<br />';
/*  Create a new Reader of the type defined in $inputFileType  **/
$reader = \PhpOffice\PhpSpreadsheet\IOFactory::createReader($inputFileType);

echo '<hr />';

/*  Define how many rows we want for each "chunk"  **/
$chunkSize = 20;

/*  Loop to read our worksheet in "chunk size" blocks  **/
for ($startRow = 2; $startRow <= 240; $startRow += $chunkSize) {
    echo 'Loading WorkSheet using configurable filter for headings row 1 and for rows ',$startRow,' to ',($startRow + $chunkSize - 1),'<br />';
    /*  Create a new Instance of our Read Filter, passing in the limits on which rows we want to read  **/
    $chunkFilter = new chunkReadFilter($startRow, $chunkSize);
    /*  Tell the Reader that we want to use the new Read Filter that we've just Instantiated  **/
    $reader->setReadFilter($chunkFilter);
    /*  Load only the rows that match our filter from $inputFileName to a PHPExcel Object  **/
    $spreadsheet = $reader->load($inputFileName);

    //  Do some processing here

    $sheetData = $spreadsheet->getActiveSheet()->toArray(null, true, true, true);
    var_dump($sheetData);
    echo '<br /><br />';
}

?>
<body>
</html>

$ chunkSize = 20; I think this is too little if you have 1,000,000 lines there, then it will be quite possible to take 25,000 lines each – this is about 25-30 seconds of processing. Essentially what this is about: We implement the filter interface, create a check method, specify the number of $ chunkSize rows read at a time, and specify the start and end lines in the loop. We get the result as an array $ sheetData = $ spreadsheet-> getActiveSheet () -> toArray (null, true, true, true);

added a minute later: as you can see, the file format is hardcoded, but you can also automatically

$inputFileType = \PhpOffice\PhpSpreadsheet\IOFactory::identify($arr['FileName']);
$reader =\PhpOffice\PhpSpreadsheet\IOFactory::createReader($inputFileType);
$reader->setReadDataOnly(true);

added 2 minutes later:

to quickly run the library I downloaded a content folder called PhpSpreadsheet, put it in the PhpOffice folder and did _autoload

//ini_set('include_path', '/var/www/main_lib');
    //error_reporting(E_ALL);
    function __autoload($class_name) {   
        try{
            require_once(str_replace( '\\', '/', $class_name ). '.php'); 
        }
        catch(Exception $e){
            echo 'err/не удалось загрузить класс '.$class_name.' либо в php.ini не выставлен параметр include_path="/var/www/main_lib"';
        }
    } 

As you may have guessed, the library is in the / var / www / main_lib folder

Scroll to Top