TLDR: How can I best preserve old CD audio quality while mounting those CDs as read-only?
I volunteer in a library and we get a lot of non-DRM audio only CDs donated. These include stuff from old manuals to children songs. We’re still exploring options on how we may best offer them to visitors.
I’m currently helping in the possible digitization. We’ve largely coalesced around the idea to re-encode everything to FLAC so that we may keep everything with their quality intact for archival purposes.
An additional but small consideration is that some of these CDs are user edited and may contain personal files, as well as malware. I’m not concerned of a targeted attack here but I really don’t want to endanger the library in any way.
The command I’m using (as part of a script that only checks for and converts WAV files) here is:
Is there anything extra I need to preserve quality?
I also added the metadata removal option because the these CDs generally seem to contain WAV files whose metadata I can’t reach. Besides the metadata of old CDs seem to be an attack vector as well.
The main reason I wanted to post here, is to ask if Fedora accepts all CDs as read-only and if not how may I make it so?
Any audio related malware doesn’t seem to be able to escape the removal of metadata in addition re-encoding. Is there a way I can also more securely mount and dismount the CDs? Again, the danger there is slim but sadly not none and multiple people suggested that they may have edited their CDs but couldn’t remember which.
Note: This isn’t a topic I’m familiar with so please feel free to offer any additional advice.
A CD by design is read only unless you have one of the rewritable disks. Even those are read only after they have been closed. You really are not concerned about a CD (or DVD) being writable since you are only reading the content.
The concern about personal files may be valid but it would be difficult to verify that without actually scanning the content to identify it.
Malware mostly could be detected by scanning the content of the disk before converting it.
My suggestion would be to scan the disks for malware then simply convert the files into an isolated (quarantine) system. After verifying the content of the file it could be released from quarantine. Yes, it may be a lot of work and consume a bit of time, but if you want to avoid potential issues it probably needs to be done. The content of each file needs to be identified anyway so it may be properly organized for access.
It will easily rip your CD’s to FLAC. It will automatically try to download metadata. As you are in a library setting, metadata is very important to later search your content.
Sound Juicer will only copy and convert WAV to FLAC (or mp3 or other). There is no risk of downloading malware. Nor is there a risk of downloading malware with FFMPEG.
The only way you could transfer malware is if you copy executable (for example .exe or .sh) files from your CDs.
I think that what you are worried about is a CD auto-executing an attack. This is possible under Windows, but very unlikely on Linux. I would not worry, there are many other vectors to attack a library (though what type of person would do that I do not know).
Could you please elaborate this point? Our initial plan was to separate an older computer with a CD drive, disconnect it from the network and slowly digitize stuff there. Then the person, who completed that days work, if they hadn’t encountered any issues, would tar/zip those CD directories and add it to the archive through a USB or something. Would that be enough or would you suggest something else?
I’m mostly just worried about accidents or unintentional harm. Its a library and we don’t have a lot of resources aside from old hardware so if there ever is a targeted attack, it’ll likely succeed anyway. Still, we’re trying to do our best in minimizing risk.
I’ll check this out, thanks for the suggestion. Metadata is very important but I’m pessimistic about how reliably it may retrieve it. I’ll test it out and post an update in this post or thread.
Update: sound-juicer uses Musicbrainz to retrieve but the cds worked better with CDDB retrieving software like fre:ac, possibly because they’re all very old.
Based on what you already noted as concerns, I might add two things: 1.) the need to address the possibility of personal information in some of the audio, and 2.) the need to add metadata to simplify organization of the media. Other than that, your plan seems exactly what I may have recommended or planned myself.
There are apps that can add the metadata (even if locally designed) to files, but I have not used them.
I know that in the past I used k3b for ripping media and IIRC it had the ability to edit the metadata while converting the media.
I sadly cannot think of anything else other than only taking .wav files, discarding everything else and a very brief check of the outputs from the archiver for the day. Everything else seems either too complicated or time-consuming and we have a considerable backlog as it is. We do our best to inform people of our circumstances so I think there isn’t an ethical issue here. If you have any ideas or concerns, I’d be happy to hear them.
So, @theprogram recommended sound-juicer in this thread but it isn’t a good fit because these old CDs seem to occasionally retrieve data from CDDB and not Musicbrainz. fre:ac worked a bit with my limited sample and has a flathub package but that one can’t retrieve from Musicbrainz. k3b seems able to query from both sources but my only Linux machine is a Silverblue that apparently can’t run it properly
I’m able to open it but can’t reach the GUI. I’ll start a separate thread for that.
Edit: The thread for that is here, in case anyone is interested.
Hey @computersavvy, you mentioned that you once used k3b for ripping media, were you able to output the audio files as FLAC files? It seems that the program in its basic form cannot and would require the libk3b8-extracodecs in Debian-based distributions. I haven’t found the library for Fedora and wanted ask you in case you had some experience here.
The plugins for FLAC and others appear to be installed however output options only show Wave, MP3 and Vorbis.
A quick search for use k3b to output flac audio returns several related links.
I have never output flac, since I only use mp3 (lossy) and ogg (much better)
I think that ffmpeg is superior in every way except the metadata. Though from the ffmpeg man page I see
Stream copy
Stream copy is a mode selected by supplying the "copy" parameter to the -codec option. It makes ffmpeg omit the decoding and encoding
step for the specified stream, so it does only demuxing and muxing. It is useful for changing the container format or modifying
container-level metadata. The diagram above will, in this case, simplify to this:
_______ ______________ ________
| | | | | |
| input | demuxer | encoded data | muxer | output |
| file | ---------> | packets | -------> | file |
|_______| |______________| |________|
Since there is no decoding or encoding, it is very fast and there is no quality loss. However, it might not work in some cases because
of many factors. Applying filters is obviously also impossible, since filters work on uncompressed data.
-metadata[:metadata_specifier] key=value (output,per-metadata)
Set a metadata key/value pair.
An optional metadata_specifier may be given to set metadata on streams, chapters or programs. See "-map_metadata" documentation for
details.
This option overrides metadata set with "-map_metadata". It is also possible to delete metadata by using an empty value.
For example, for setting the title in the output file:
ffmpeg -i in.avi -metadata title="my title" out.flv
To set the language of the first audio stream:
ffmpeg -i INPUT -metadata:s:a:0 language=eng OUTPUT
There are other hints about metadata in that man page as well.
I too like ffmpeg, besides I was a bit troubled about how we would have utilized database lookup after the secure extraction part was done in a disconnected computer.
I guess, I’ll have to evaluate the options between the relatively simple fre:ac and very capable though a bit more burdensome ffmpeg. Then, we’ll decide as a group and carry on from there. Thank you for the input.
What types of metadata are you wanting to capture?
Album names and track data from commercial CDs?
Or date of creation, name of contributor, and file titles currently on the CD?
If you want to capture album names and track data from commercial CDs, you will have to be online to connect to a music database.
Yet you want to be offline for security. I think in this use case there is no problem to be online.
If you want to capture metadata from home-made CDs, the info will already be on the disc, and not retrievable from online databases.
You also have the option of ripping CDs to .ISO files, so you capture an exact copy of the whole CD for archival retrieval later. If your goal is to make music accessible, this is not a good idea. If your goal is to archive for archaeological purposes this is a good idea.
If you want to use FFMPEG, someone here will help you get the right commands. Jeff is recommending you search and try and find what you need first, but if you are unsure or have the wrong output, someone will help check. Once you have the right command it is easy.
Take your time to get what you need, if you are archiving a large collection and feeling burnt-out with time constraints, you will be more burned by getting your methodology wrong than taking another week to get it right.
Let us know more about the collection. How are you going to store it long term? HDD, server RAID, Blu-Ray, M-disc?
Who will have access?
Is is commercial music or home made music?
Will you upload metadata to a library catalogue?
Sorry for the late reply, I was very busy since my last post here. Thanks a lot for taking the time to ask for clarification in such a detailed answer. I’ll try to respond in kind.
This is a small city library. We get funding for basic stuff from the city but many of the activities are also volunteer supported. Our goal is to serve our community to the best of our resources, we don’t have the time or resources to contribute to national archives or something like that. So, no ISOs .
I am one of the tech-related volunteers. My responsibility is mostly confined to help organize efforts around data (or media?) input. This ranges from finding details for books to adding old CDs and cassettes to our archive. How that data is preserved and presented is something the other volunteers are concerned with. Though, I do sometimes help other stuff like basic security/accessibility checks. Before, I was on-boarded, there were several Windows 7 computers. I barely convinced staff for a switch to Fedora and Ubuntu. The other volunteers agreed and today a lot of children meet Linux through our computers and I sometimes even give brief courses.
This isn’t my responsibility but just to satisfy your curiosity, the current system is a very basic 3-2-1 rule implementation with restic. We want to expand and make it more efficient in the future but not in this year or the next. The only thing relevant here is that I plan on adding tracks as FLAC files to preserve quality while saving space.
Again, just to satisfy your curiosity, our current system works to serve material to our visitors and registered teachers in the city. We also want to expand this further and combine our efforts with other libraries but again, not in the near future.
Okay, so this is my purview and I would love to continue to discussion here. All CDs are outwardly commercial types like music, audio books, radio compilations, etc. They all have nice explanations and designs on the CDs. The slight problem is that some of them have been repurposed by users and we can’t tell them apart without checking.
We only accept home-made CDs or any other user-created media in specific formats like cloud links and that is handled by someone else. That person also takes in stuff like local school magazines and does a great job handling direct user contributions. That stuff also has its own criteria so it isn’t relevant here.
We only care for basic stuff like when, who it is from and track names. Basic metadata is indeed crucial because we wish to make all media accessible to our visitors. Otherwise people, children especially, don’t interact with older media.
This is the crux of the matter. We can’t outwardly distinguish unedited and edited commercial CDs apart. We’ve already seen a some odd stuff here and paused digitization before we could figure out a better workflow. People who donate aren’t malicious (there can be rare exceptions to prove the rule, not my concern here) but can often forget stuff or put files in seemingly normal CDs that… they probably shouldn’t have . Some degree of review is always required.
My feeling was that if there is a danger from commercial CDs (such as a malicious file in a user added folder or user modified track) that could threaten the chopstick IT infrastructure we built in the library, it might be worth it to just add the meager metadata we want manually. Something that is time-consuming but certainly not difficult with existing software like fre:ac. I estimate that at least a third of our CDs don’t have metadata to be retrieved anyway and that also decreases the cost of doing it all manually. Separating a computer and doing it as such is trivial compared to how bad it would be for some stupid ransomware locking our archive or worse, us inadvertently spreading malware ourselves.
Are you suggesting that we may as well digitize our commercial CD collection since the risk is too low? Is there no chance that a CD mounted in Fedora, can create a problem without the user executing files from the CD?
Most of that media should have metadata on CDDB so ripping them with a system that has internet access would be of help
Also be aware that a commercially produced CD/DVD is fixed for content and usually cannot be modified in any way. If you have some that have been repurposed then it is likely a home generated content and not an official commercially distributed disk. It may have an home printed label that can appear to be commercial, but the content would not likely be purely commercial.
With that said, it is quite possible to make accurate copies of the original disk as well.
I can tell the difference between commercial pressed discs and home made. The pressed discs have more of a plastic cover on the underside, the home made ones are more silvery or brighter blue. Take some known discs and compare them. It is quite easy one you know what to look for.
There is a zero chance of malware being on a commercial disc. I would separate them into two piles, and definitely do online metadata for the commercial discs. It is of course impossible to edit or add files to a pressed disc.
Home made discs are either an exact copy, and you may be able to access online metadata in this case - or it is a user compiled track listing and you will not be able to get any metadata so it will be an offline manual typing project.
I was able to check today, and I could get a more consistent reflection from commercial CDs while the edited ones were more colorful. Is that what you were pointing towards?
If not, could you please refer me to some YouTube video or a blog post with pictures that explain the difference? I tried checking myself but most content is so old that links often don’t work or pictures don’t load.
Even unenthusiastic staff are big believers in showing visitors that something other than Windows and Macs exist, so really, its a team effort.
Note: I realized that I omitted a detail which may be coloring you guys optimism in metadata retrieval. I’m not from an English speaking country, our media is not so consistently archived. Of course, “pressed CDs” not posing a risk is still great news and we"ll gladly use existing software.
That is one way, though for me a more consistent way is to check the labels carefully. A paper label is sure-fire home labeled. A printed label on a disk that is not professional quality with often a white background is also sure-fire home grown.
I used to copy some cds for games and such and even with a very good scanned image of the original label it was impossible to create a perfect copy of the disk label. It had to be printed on a paper or plastic label when using a laser printer then applied to the disk, or if printing directly on the disk had to be done with an inkjet printer. In both cases it was easy to tell the label was not original quality.