Mass update of Docs metadata

hankuoffroad · June 17, 2023, 1:16pm

Continuing the discussion from How to pull activity data from Pagure:

Problem description

What I want to do is mass update of the following metadata by appending the files with defined category and tags for each article. I can run the ls command on cloned repository, but what filters/regex would extract Docs metadata with one command?

:category: CATEGORY
:tags: TAG_01 TAG_02 TAG_03 ... TAG_n

Example of Docs metadata to be updated in bulk

filename	category	tag1	tag2	tag3
performing-administration-tasks-using-sudo.adoc	Administration	tutorial
postgresql.adoc	Installation	How-to	server
proc_setting-key-shortcut.adoc	Managing software	How-to		Troubleshooting
publish-rpm-on-copr.adoc	Upgrading	tutorial
qemu.adoc			server
raspberry-pi.adoc
reset-root-password.adoc
root-account-locked.adoc
samba.adoc

Check the number of target files

On cloned file directory “modules/ROOT/pages”, run the following command.

$ find -type f -name "*.adoc" | wc -l

Rough process steps

List the files and metadata and save them onto a CSV file
Update metadata on a CSV file
List the files and metadata, loop over them, append the text with echo "text" >> $file

Question on point 1 above

How would you list file names and metadata from Pagure and download them from multiple files (more than 250) with a single command? Metadata is mostly empty.

ankursinha · June 20, 2023, 10:06am

One may be able to get the file list using the Pagure API, but to get the metadata, I expect one has to parse the file to grep it etc.?

I don’t know if one command would be enough, but a script that goes something like this would work?

# clone all repos using a for loop in a dir somewhere
git clone "$REPO_URL"

# use rg/grep to extract metadata to a file
category_result="$(rg -i '^:category:' -g '*.adoc' --no-heading -H | sed 's/:category://')"
category="$(echo $category_result | cut -d ':' -f2)"
category_fn="$(echo $category_result | cut -d ':' -f1)
tags="$(rg -i "^:tags:' -g $category_fn --no-heading  -I| sed -e 's/:tags: //' | tr ' ' ',')"
echo "$category_fn,$category,$tags" >> myfile.csv
..

# process csv, add to files as required

This only gets tags from each file that has a category, so the assumption is that each file must have a category if it has tags. The logic can also be reversed. If there’s no guarantee that files will have both, we’re probably going to do two csv files, one for category, and one for tags, and then merge them column wise using paste.

What do you think?

hankuoffroad · June 24, 2023, 9:04pm

Thanks for your generous help. I’m better informed by your script. As a beginner of bash shell scripting, I have the following alert when running the script. Could you advise how to correct it? Thanks in advance!

> line 15: unexpected EOF while looking for matching )'`

I ran the shellcheck utility to get sneak peak on bash shell syntax suggestions. Please see below.

line 9:
category_fn="$(echo $category_result | cut -d ‘:’ -f1)
^-- SC1009 (info): The mentioned syntax error was in this variable assignment.
^-- SC1078 (warning): Did you forget to close this double quoted string?

line 10:
tags="$(rg -i “^:tags:’ -g $category_fn --no-heading -I| sed -e ‘s/:tags: //’ | tr ’ ’ ‘,’)”
^-- SC1079 (info): This is actually an end quote, but due to next char it looks suspect.
^-- SC1073 (error): Couldn’t parse this command expansion. Fix to allow more checks.

line 15:

^-- SC1072 (error): Expected end of $(…) expression. Fix any mentioned problems and try again.

pboy · June 24, 2023, 10:42pm

Without overlooking the script completely, the number of " in this line don’t match. The "^: looks suspicious to me. Probably it’s '^:tags:’

hankuoffroad · June 25, 2023, 2:54pm

Uploaded the script (before I tinker with) to my personal project repo.

^[1]

My reference and holiday reading list
Bash shell scripting for beginners
The Linux Command Line by William Shotts ↩︎

ankursinha · June 26, 2023, 8:42am

Yeh, that should do it.

ankursinha · June 26, 2023, 9:22am

I think maybe a bash script is not the right tool for this. Things like quoting and multiline strings are tricky to get right. I’ll tinker with it a little more, but may end up resorting to Python or something if a bash script is too complex

ankursinha · June 26, 2023, 9:50am

I opened a PR with a much better solution:

It gets the categories and tags lists to different files, and then we use join to combine them column wise based on the file name field.

hankuoffroad · June 26, 2023, 11:01am

I retyped the line that threw error code in the shellcheck (showing SC1112 (warning): This is a unicode quote. Delete and retype it (or ignore/doublequote for literal).

A CSV file is created successfully. Thanks.

A CSV file shows a duplicated entry of using-yubikeys.adoc file. There are two lines of data. I’ll also try different regex parameters to get the right result for the target files.

hankuoffroad · June 26, 2023, 11:28am

Your new bash script works! Six files are created. A big thank you.

Topic		Replies	Views
Docs contributor guide updated Project Discussion docs-team	2	208	March 25, 2023
How to pull activity data from Pagure Project Discussion commops-team , infrastructure-team	6	282	June 1, 2023
Current status of the Quick Docs migration Project Discussion docs-team	8	351	September 7, 2023
Pagure Web UI for Docs contribution Project Discussion docs-team	6	339	April 23, 2023
Goals for documentation content in the next 3 months? Project Discussion docs-team	13	701	March 11, 2022

Mass update of Docs metadata

Problem description

Check the number of target files

Rough process steps

Question on point 1 above

Related topics