Customize pdfgrep output

yifan · October 4, 2020, 4:21pm

I am trying to use pdfgrep to search a keyword among multiple pdf files within a folder. My command is this:

pdfgrep -in keyword *.pdf

This command will output the pdf file names and page numbers on the terminal. I would like each pdf file name to only show once if its contents contain the keyword. That will make it easier to read.

Also, I would like the output to be in a txt file in the current directory instead of showing on the terminal.

computersavvy · October 4, 2020, 5:44pm

Try piping the output to some other utilities. I believe the following should work.

 pdfgrep -in keyword *.pdf | sed  ' s/\.pdf.*/.pdf/' | sort --unique

This worked for me in a directory that had many pdf files, many of which had several repetitions of the keyword.

Putting the output to a file is as simple as redirecting it using “command > file”

yifan · October 4, 2020, 6:12pm

This code worked for me:

pdfgrep -in keyword *.pdf | sed ' s/\.pdf.*/.pdf/' | sort --unique > note.txt

However, it runs very slow. Are there any ways to speed it up?

computersavvy · October 4, 2020, 6:37pm

The amount of data determines the speed. Each line returned from pdfgrep is piped through both sed and sort, so just be patient.

Your definition of “very slow” may also be very different from mine.

laolux · October 5, 2020, 8:03am

Ok, if you only want the pdf file name to show once, what about

pdfgrep -i keyword *.pdf

This will of course not give you the page number, but might be faster because pdfgrep can stop scanning once it has found an occurrence.

Edit: I just checked, it seems to be faster, but only a small bit in my test case.

yifan · October 6, 2020, 12:09am

this command returns file names repeatedly.

laolux · October 7, 2020, 2:55am

Yes, because my command is misspelled, sorry about that.

Correct would be

pdfgrep -l keyword *.pdf

Note, -l instead of -i.
This command seems to return every matching pdf only once.

yifan · October 7, 2020, 3:14am

I will go with pdfgrep -il keyword *.pdf to ignore case.

system · November 4, 2020, 3:14am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Image magic command stopped working Ask Fedora	6	220	September 22, 2022
What /usr/bin/egrep and /usr/bin/fgrep do in fedora? Ask Fedora	3	1113	December 14, 2022
Gnome-Files (Nautilus), Tracker3: Search action with the Full-Text option enabled shows results only from pdf files and exclude .odt .py .md files Ask Fedora f34 , nautilus , gnome	10	2511	October 8, 2021
DNF times out during auto complete (can be reproduced) Ask Fedora f37 , dnf	2	321	November 20, 2022
Output of dnf repolist Ask Fedora f36	21	850	October 24, 2022

Customize pdfgrep output

Related topics