Adding OCR info to a PDF












23















I have a good quality scan of a document; such scan is in pdf format.



How can I add ocr information to the pdf, so that it becomes searchable? By searchable I mean that the goal is that when viewing the pdf with evince, CTRL-F actually allows me to search in the pdf content.










share|improve this question

























  • Duplicate? askubuntu.com/questions/16268/…

    – Jakob
    Jun 7 '12 at 9:04






  • 4





    @Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.

    – fdierre
    Jun 7 '12 at 10:17













  • How, and what did you use to scan the document?

    – Mitch
    Jun 7 '12 at 11:05











  • @Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)

    – fdierre
    Jun 7 '12 at 12:06











  • Scanning and/or OCR Software?

    – Mitch
    Jun 7 '12 at 12:18
















23















I have a good quality scan of a document; such scan is in pdf format.



How can I add ocr information to the pdf, so that it becomes searchable? By searchable I mean that the goal is that when viewing the pdf with evince, CTRL-F actually allows me to search in the pdf content.










share|improve this question

























  • Duplicate? askubuntu.com/questions/16268/…

    – Jakob
    Jun 7 '12 at 9:04






  • 4





    @Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.

    – fdierre
    Jun 7 '12 at 10:17













  • How, and what did you use to scan the document?

    – Mitch
    Jun 7 '12 at 11:05











  • @Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)

    – fdierre
    Jun 7 '12 at 12:06











  • Scanning and/or OCR Software?

    – Mitch
    Jun 7 '12 at 12:18














23












23








23


14






I have a good quality scan of a document; such scan is in pdf format.



How can I add ocr information to the pdf, so that it becomes searchable? By searchable I mean that the goal is that when viewing the pdf with evince, CTRL-F actually allows me to search in the pdf content.










share|improve this question
















I have a good quality scan of a document; such scan is in pdf format.



How can I add ocr information to the pdf, so that it becomes searchable? By searchable I mean that the goal is that when viewing the pdf with evince, CTRL-F actually allows me to search in the pdf content.







pdf scanning ocr






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jun 7 '12 at 10:19







fdierre

















asked Jun 7 '12 at 8:56









fdierrefdierre

49831022




49831022













  • Duplicate? askubuntu.com/questions/16268/…

    – Jakob
    Jun 7 '12 at 9:04






  • 4





    @Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.

    – fdierre
    Jun 7 '12 at 10:17













  • How, and what did you use to scan the document?

    – Mitch
    Jun 7 '12 at 11:05











  • @Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)

    – fdierre
    Jun 7 '12 at 12:06











  • Scanning and/or OCR Software?

    – Mitch
    Jun 7 '12 at 12:18



















  • Duplicate? askubuntu.com/questions/16268/…

    – Jakob
    Jun 7 '12 at 9:04






  • 4





    @Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.

    – fdierre
    Jun 7 '12 at 10:17













  • How, and what did you use to scan the document?

    – Mitch
    Jun 7 '12 at 11:05











  • @Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)

    – fdierre
    Jun 7 '12 at 12:06











  • Scanning and/or OCR Software?

    – Mitch
    Jun 7 '12 at 12:18

















Duplicate? askubuntu.com/questions/16268/…

– Jakob
Jun 7 '12 at 9:04





Duplicate? askubuntu.com/questions/16268/…

– Jakob
Jun 7 '12 at 9:04




4




4





@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.

– fdierre
Jun 7 '12 at 10:17







@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.

– fdierre
Jun 7 '12 at 10:17















How, and what did you use to scan the document?

– Mitch
Jun 7 '12 at 11:05





How, and what did you use to scan the document?

– Mitch
Jun 7 '12 at 11:05













@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)

– fdierre
Jun 7 '12 at 12:06





@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)

– fdierre
Jun 7 '12 at 12:06













Scanning and/or OCR Software?

– Mitch
Jun 7 '12 at 12:18





Scanning and/or OCR Software?

– Mitch
Jun 7 '12 at 12:18










6 Answers
6






active

oldest

votes


















15














pdfsandwich



Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:



pdfsandwich scanned.pdf


Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE package) and setting the layout:



pdfsandwich  -verbose -lang spa -layout single scanned.pdf


If you get any error please download last version deb from Sourceforge.



Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.






share|improve this answer





















  • 6





    This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

    – naught101
    Feb 9 '15 at 2:47











  • Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

    – A.B.
    Apr 22 '15 at 5:55











  • Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

    – Pablo Bianchi
    Mar 9 '17 at 21:46






  • 1





    @PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

    – zrajm
    Jun 20 '17 at 15:44











  • @zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

    – Pablo Bianchi
    Jun 21 '17 at 18:40





















7














There are two projects which do the trick: GScan2PDF and OCRFeeder






share|improve this answer

































    3














    I found a non-ideal solution, but a very effective one.



    I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.



    Thus you can search and copy text from this invisible layer.



    enter image description here






    share|improve this answer































      2














      For a command line solution, you can use pdfocr.



      In brief, install software:



      $ sudo apt-get install python-software-properties
      $ sudo add-apt-repository ppa:gezakovacs/pdfocr
      $ sudo apt-get update
      $ sudo apt-get install pdfocr


      Then run pdfocr:



      $ pdfocr -i scanned.pdf -o scanned.with.search.pdf


      That worked for me on Ubuntu 12.04 LTS.






      share|improve this answer



















      • 6





        Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

        – jmiserez
        Mar 21 '15 at 18:31





















      2














      A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:



      https://github.com/jbarlow83/OCRmyPDF






      share|improve this answer
























      • I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

        – Maxim
        May 3 '18 at 15:04



















      0














      This is my quick and dirty solution based on ImageMagick's convert, tesseract, parallel and pdftk (all available on debian-based distributions). It's largely based on this blog post.



      #!/bin/sh -ex

      density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given

      convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif
      parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
      pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress

      # Cleanup temp files
      rm page_?????.tif page_?????.pdf





      share|improve this answer








      New contributor




      stefanct is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















        Your Answer








        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "89"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f147679%2fadding-ocr-info-to-a-pdf%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        6 Answers
        6






        active

        oldest

        votes








        6 Answers
        6






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        15














        pdfsandwich



        Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:



        pdfsandwich scanned.pdf


        Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE package) and setting the layout:



        pdfsandwich  -verbose -lang spa -layout single scanned.pdf


        If you get any error please download last version deb from Sourceforge.



        Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.






        share|improve this answer





















        • 6





          This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

          – naught101
          Feb 9 '15 at 2:47











        • Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

          – A.B.
          Apr 22 '15 at 5:55











        • Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

          – Pablo Bianchi
          Mar 9 '17 at 21:46






        • 1





          @PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

          – zrajm
          Jun 20 '17 at 15:44











        • @zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

          – Pablo Bianchi
          Jun 21 '17 at 18:40


















        15














        pdfsandwich



        Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:



        pdfsandwich scanned.pdf


        Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE package) and setting the layout:



        pdfsandwich  -verbose -lang spa -layout single scanned.pdf


        If you get any error please download last version deb from Sourceforge.



        Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.






        share|improve this answer





















        • 6





          This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

          – naught101
          Feb 9 '15 at 2:47











        • Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

          – A.B.
          Apr 22 '15 at 5:55











        • Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

          – Pablo Bianchi
          Mar 9 '17 at 21:46






        • 1





          @PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

          – zrajm
          Jun 20 '17 at 15:44











        • @zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

          – Pablo Bianchi
          Jun 21 '17 at 18:40
















        15












        15








        15







        pdfsandwich



        Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:



        pdfsandwich scanned.pdf


        Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE package) and setting the layout:



        pdfsandwich  -verbose -lang spa -layout single scanned.pdf


        If you get any error please download last version deb from Sourceforge.



        Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.






        share|improve this answer















        pdfsandwich



        Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:



        pdfsandwich scanned.pdf


        Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE package) and setting the layout:



        pdfsandwich  -verbose -lang spa -layout single scanned.pdf


        If you get any error please download last version deb from Sourceforge.



        Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Mar 10 '17 at 4:03









        Pablo Bianchi

        2,4451530




        2,4451530










        answered Jul 25 '14 at 13:27









        Tobias ElzeTobias Elze

        21923




        21923








        • 6





          This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

          – naught101
          Feb 9 '15 at 2:47











        • Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

          – A.B.
          Apr 22 '15 at 5:55











        • Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

          – Pablo Bianchi
          Mar 9 '17 at 21:46






        • 1





          @PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

          – zrajm
          Jun 20 '17 at 15:44











        • @zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

          – Pablo Bianchi
          Jun 21 '17 at 18:40
















        • 6





          This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

          – naught101
          Feb 9 '15 at 2:47











        • Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

          – A.B.
          Apr 22 '15 at 5:55











        • Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

          – Pablo Bianchi
          Mar 9 '17 at 21:46






        • 1





          @PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

          – zrajm
          Jun 20 '17 at 15:44











        • @zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

          – Pablo Bianchi
          Jun 21 '17 at 18:40










        6




        6





        This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

        – naught101
        Feb 9 '15 at 2:47





        This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

        – naught101
        Feb 9 '15 at 2:47













        Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

        – A.B.
        Apr 22 '15 at 5:55





        Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

        – A.B.
        Apr 22 '15 at 5:55













        Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

        – Pablo Bianchi
        Mar 9 '17 at 21:46





        Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

        – Pablo Bianchi
        Mar 9 '17 at 21:46




        1




        1





        @PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

        – zrajm
        Jun 20 '17 at 15:44





        @PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

        – zrajm
        Jun 20 '17 at 15:44













        @zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

        – Pablo Bianchi
        Jun 21 '17 at 18:40







        @zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

        – Pablo Bianchi
        Jun 21 '17 at 18:40















        7














        There are two projects which do the trick: GScan2PDF and OCRFeeder






        share|improve this answer






























          7














          There are two projects which do the trick: GScan2PDF and OCRFeeder






          share|improve this answer




























            7












            7








            7







            There are two projects which do the trick: GScan2PDF and OCRFeeder






            share|improve this answer















            There are two projects which do the trick: GScan2PDF and OCRFeeder







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Feb 19 '13 at 10:02









            Ashwin Nanjappa

            84911327




            84911327










            answered Jun 7 '12 at 21:24









            AldiAldi

            711




            711























                3














                I found a non-ideal solution, but a very effective one.



                I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.



                Thus you can search and copy text from this invisible layer.



                enter image description here






                share|improve this answer




























                  3














                  I found a non-ideal solution, but a very effective one.



                  I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.



                  Thus you can search and copy text from this invisible layer.



                  enter image description here






                  share|improve this answer


























                    3












                    3








                    3







                    I found a non-ideal solution, but a very effective one.



                    I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.



                    Thus you can search and copy text from this invisible layer.



                    enter image description here






                    share|improve this answer













                    I found a non-ideal solution, but a very effective one.



                    I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.



                    Thus you can search and copy text from this invisible layer.



                    enter image description here







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Feb 19 '13 at 10:31









                    To DoTo Do

                    8,62194891




                    8,62194891























                        2














                        For a command line solution, you can use pdfocr.



                        In brief, install software:



                        $ sudo apt-get install python-software-properties
                        $ sudo add-apt-repository ppa:gezakovacs/pdfocr
                        $ sudo apt-get update
                        $ sudo apt-get install pdfocr


                        Then run pdfocr:



                        $ pdfocr -i scanned.pdf -o scanned.with.search.pdf


                        That worked for me on Ubuntu 12.04 LTS.






                        share|improve this answer



















                        • 6





                          Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

                          – jmiserez
                          Mar 21 '15 at 18:31


















                        2














                        For a command line solution, you can use pdfocr.



                        In brief, install software:



                        $ sudo apt-get install python-software-properties
                        $ sudo add-apt-repository ppa:gezakovacs/pdfocr
                        $ sudo apt-get update
                        $ sudo apt-get install pdfocr


                        Then run pdfocr:



                        $ pdfocr -i scanned.pdf -o scanned.with.search.pdf


                        That worked for me on Ubuntu 12.04 LTS.






                        share|improve this answer



















                        • 6





                          Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

                          – jmiserez
                          Mar 21 '15 at 18:31
















                        2












                        2








                        2







                        For a command line solution, you can use pdfocr.



                        In brief, install software:



                        $ sudo apt-get install python-software-properties
                        $ sudo add-apt-repository ppa:gezakovacs/pdfocr
                        $ sudo apt-get update
                        $ sudo apt-get install pdfocr


                        Then run pdfocr:



                        $ pdfocr -i scanned.pdf -o scanned.with.search.pdf


                        That worked for me on Ubuntu 12.04 LTS.






                        share|improve this answer













                        For a command line solution, you can use pdfocr.



                        In brief, install software:



                        $ sudo apt-get install python-software-properties
                        $ sudo add-apt-repository ppa:gezakovacs/pdfocr
                        $ sudo apt-get update
                        $ sudo apt-get install pdfocr


                        Then run pdfocr:



                        $ pdfocr -i scanned.pdf -o scanned.with.search.pdf


                        That worked for me on Ubuntu 12.04 LTS.







                        share|improve this answer












                        share|improve this answer



                        share|improve this answer










                        answered Mar 23 '14 at 20:23









                        Robert CitekRobert Citek

                        211




                        211








                        • 6





                          Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

                          – jmiserez
                          Mar 21 '15 at 18:31
















                        • 6





                          Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

                          – jmiserez
                          Mar 21 '15 at 18:31










                        6




                        6





                        Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

                        – jmiserez
                        Mar 21 '15 at 18:31







                        Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

                        – jmiserez
                        Mar 21 '15 at 18:31













                        2














                        A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:



                        https://github.com/jbarlow83/OCRmyPDF






                        share|improve this answer
























                        • I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

                          – Maxim
                          May 3 '18 at 15:04
















                        2














                        A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:



                        https://github.com/jbarlow83/OCRmyPDF






                        share|improve this answer
























                        • I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

                          – Maxim
                          May 3 '18 at 15:04














                        2












                        2








                        2







                        A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:



                        https://github.com/jbarlow83/OCRmyPDF






                        share|improve this answer













                        A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:



                        https://github.com/jbarlow83/OCRmyPDF







                        share|improve this answer












                        share|improve this answer



                        share|improve this answer










                        answered Nov 8 '17 at 16:47









                        user127022user127022

                        211




                        211













                        • I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

                          – Maxim
                          May 3 '18 at 15:04



















                        • I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

                          – Maxim
                          May 3 '18 at 15:04

















                        I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

                        – Maxim
                        May 3 '18 at 15:04





                        I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

                        – Maxim
                        May 3 '18 at 15:04











                        0














                        This is my quick and dirty solution based on ImageMagick's convert, tesseract, parallel and pdftk (all available on debian-based distributions). It's largely based on this blog post.



                        #!/bin/sh -ex

                        density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given

                        convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif
                        parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
                        pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress

                        # Cleanup temp files
                        rm page_?????.tif page_?????.pdf





                        share|improve this answer








                        New contributor




                        stefanct is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.

























                          0














                          This is my quick and dirty solution based on ImageMagick's convert, tesseract, parallel and pdftk (all available on debian-based distributions). It's largely based on this blog post.



                          #!/bin/sh -ex

                          density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given

                          convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif
                          parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
                          pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress

                          # Cleanup temp files
                          rm page_?????.tif page_?????.pdf





                          share|improve this answer








                          New contributor




                          stefanct is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.























                            0












                            0








                            0







                            This is my quick and dirty solution based on ImageMagick's convert, tesseract, parallel and pdftk (all available on debian-based distributions). It's largely based on this blog post.



                            #!/bin/sh -ex

                            density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given

                            convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif
                            parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
                            pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress

                            # Cleanup temp files
                            rm page_?????.tif page_?????.pdf





                            share|improve this answer








                            New contributor




                            stefanct is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.










                            This is my quick and dirty solution based on ImageMagick's convert, tesseract, parallel and pdftk (all available on debian-based distributions). It's largely based on this blog post.



                            #!/bin/sh -ex

                            density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given

                            convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif
                            parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
                            pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress

                            # Cleanup temp files
                            rm page_?????.tif page_?????.pdf






                            share|improve this answer








                            New contributor




                            stefanct is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            share|improve this answer



                            share|improve this answer






                            New contributor




                            stefanct is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            answered 7 hours ago









                            stefanctstefanct

                            1011




                            1011




                            New contributor




                            stefanct is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.





                            New contributor





                            stefanct is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.






                            stefanct is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Ask Ubuntu!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f147679%2fadding-ocr-info-to-a-pdf%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                GameSpot

                                日野市

                                Tu-95轟炸機