Traditional Approach to PDF

Picture of Jun
Jun
Pre-processing
Mar 1, 2021 7:44:18 AM

To translate PDF files, try to use the source IDML file when available. In InDesign, INDD files can be saved as an IDML file which can be processed in computer-assisted translation (CAT) tools. This way, all the formatting can safely be handled while working on the translation in the CAT tool. According to memoQ, their tool can even import INDD files directly.

Today's CAT tools also have direct PDF file import features, but the resulting docx file (Microsoft Word format that is generated internally by the CAT tool) will contain many extra line breaks. These line breaks need to be removed so translators can efficiently work on the document. Otherwise, the translator will need to repeatedly use the join segment feature while translating in the CAT tool. 

So, if no source INDD file is available, or if Microsoft Word format is preferred as an intermediate file for some reason, here is a traditional, minimum pre-processing for translators.

  1. Open the PDF in Adobe Reader. Select all and copy. This will store the PDF content as rich text format although some formatting can be lost.
  2. Open Microsoft Word and paste the clipboard content. At this point, the Word file contains many unwanted line breaks. 
  3. Open Advanced Find and Replace in Word, and select the Use Wildcards option.
  4. Find [^l^13]([!A-Z]) and replace with a whitespace + \1. See below for the screenshot.
    Line break replacement in Word
  5. Go through the document by clicking the Find Next button, and click the Replace button if the line break is not needed. 
There can be more formatting issues, such as extra page breaks, repeated header and footer texts, and inappropriate 2-column paragraphs. However, the approach explained above can be a starting point to make the body of the document translatable.