LinuxQuestions.org - download and convert multiple youtube links

LinuxQuestions.org (/questions/)

- LinuxQuestions.org Member Success Stories (https://www.linuxquestions.org/questions/linuxquestions-org-member-success-stories-23/)

- - download and convert multiple youtube links (https://www.linuxquestions.org/questions/linuxquestions-org-member-success-stories-23/download-and-convert-multiple-youtube-links-4175618427/)

download and convert multiple youtube links

so i wanted to get music from youtube

this is the simple version, it gets files one by one, it only supports video urls (no playlists)
you need to have youtube-dl installed (latest version)
i got it from here https://packages.debian.org/sid/all/youtube-dl/download
just put your youtube urls in file called "list0.txt" one on each line
it downloads m4a (audio only) and then converts to mp3

Code:

#!/bin/bash

START=$(date +%s)

 

echo "get links and download files one by one"

youtube-dl -a list.txt --no-warnings --no-check-certificate -f 140

 

echo "convert files one by one"

for i in *.m4a; do ffmpeg -i "$i" -acodec libmp3lame -aq 2 "${i%.*}.mp3"; done

 

END=$(date +%s)

DIFF=$(( $END - $START ))

echo time-is  $DIFF

this is the second version, it does stuff in parallel
for this you also need to install aria2 and gnu parallel

just put your youtube urls (videos/playlists/channels) in "list0.txt" one on each line

Code:

#!/bin/bash

 

echo log started - $(date) > g.log



START=$(date +%s)



youtube-dl -a list0.txt -j --flat-playlist --no-check-certificate  > list1.txt

cat list1.txt |sed 's/": "/__/g ; s/"/_/g; s/,/|/g ;'|grep -Po '_id__(.{11})_\|'|sort|uniq | sed 's/_id__//g;s/_|//g' \

 |sed -e 's/^/https:\/\/www.youtube.com\/watch?v=/'>list2.txt



END=$(date +%s)

DIFF=$(( $END - $START ))

echo get video ids from links - serial youtube-dl - $DIFF >> g.log



#=====================================



START=$(date +%s)



cat list2.txt|parallel -j8 youtube-dl "{}" --no-warnings --no-check-certificate --skip-download -q -f 140 --get-filename --get-url -o "\|\|%\(id\)s_%\(title\)s.m4a" \> \$\{RANDOM\}.ttt

cat *.ttt|sed 's/||/  out=/g'>d2.txt

rm *.ttt





END=$(date +%s)

DIFF=$(( $END - $START ))

echo get download links - parallel youtube-dl - $DIFF >> g.log



#=====================================



START=$(date +%s)



echo "download - parallel"

echo time = $(date)

aria2c --check-certificate=false -i d2.txt -j 4



END=$(date +%s)

DIFF=$(( $END - $START ))

echo download video/audio - aria2 - $DIFF >> g.log



#=====================================



START=$(date +%s)



echo "convert - parallel"

echo time = $(date)

find . -name "*.m4a" |parallel -j2  ffmpeg -i "{}" -acodec libmp3lame -aq 2 "{}.mp3"

 

END=$(date +%s)

DIFF=$(( $END - $START ))

echo convert audio/video - parallel ffmpeg - $DIFF >> g.log

the m4a/aac format is better quality, most of devices nowadays support it and is smaller in size, the mp3 conversion takes some time, but its supported by any device

tell me what you think

new version 20190820

download audio of videos with more than 1_000_000 views from multiple channels m4a, and convert to mp3

Code:

#!/bin/bash





#list0.txt contains list of channels, playlists or videos, one per line

#list1.txt contains youtube output of playlists,channels

#list2.txt contains all youtube ids found in the channels,playlists you posted in list0.txt





#stop the script with ctrl+z 

#if you dont delete list2.txt the script will download the same ids skipping those wich exist



#if list2.txt doesent exist download all channels in list0.txt and create list2.txt again

FILE=list2.txt

if [[ -f "$FILE" ]]; then

    echo "$FILE exist"

else

youtube-dl -a list0.txt -j --flat-playlist --no-check-certificate  > list1.txt

cat list1.txt |sed 's/": "/__/g ; s/"/_/g; s/,/|/g ;'|grep -Po '_id__(.{11})_'|sort|uniq | sed 's/_id__//g;s/_$//g' |grep -v \| >list2.txt

fi







#get ids of all downloaded files in the current directory

ex=$(ls -l|grep m4a|grep -Po '.{15}$'|grep -Po '^.{11}')







#check each file in list2.txt

while read p; do





#if file is already downloaded

if [[ $ex == *$p* ]]; then

echo "----- $p exists ----- "

#if file doesent exist download, if video has more than min-views

  elif [[ $ex != *$p* ]];then

 echo "$p not found downloading "

 youtube-dl https://www.youtube.com/watch?v=$p --min-views 1_000_000 -f 140

fi





done <list2.txt







#convert every m4a file to mp3, j4 means use 4 threads, for a 4 core cpu

find . -name "*.m4a" |parallel -j4  ffmpeg -i "{}" -acodec libmp3lame -aq 2 "{}.mp3"



#mkdir mp & mv *mp3 mp/ & mkdir m4 & mv *m4a m4/

Quote:

tell me what you think

Code:

cat list1.txt |sed 's/": "/__/g ; s/"/_/g; s/,/|/g ;'|grep -Po '_id__(.{11})_'|sort|uniq | sed 's/_id__//g;s/_$//g' |grep -v \| >list2.txt

This is a lot of piping through sed. There's probably a way of shrinking that down to 1-2 commands. I don't know sed well enough, but it might be worth a post to ask about

Code:

youtube-dl -a list0.txt -j --flat-playlist --no-check-certificate  > list1.txt

cat list1.txt |sed 's/": "/__/g ; s/"/_/g; s/,/|/g ;'|grep -Po '_id__(.{11})_'|sort|uniq | sed 's/_id__//g;s/_$//g' |grep -v \| >list2.txt

...

done <list2.txt

Have you considered process substitution? It can help get rid of files floating around

Instead of

Code:

find . > file1

while read i;

do

  echo "$i"

done < file1

Code:

while read i;

do

  echo "$i"

done < <(find .)

Code:

find . -name "*.m4a" |parallel -j4 ffmpeg -i "{}" -acodec libmp3lame -aq 2 "{}.mp3"

Both find and parallel support using null separators. find with -print0 and parallel with --null or -0. This can reduce the chance for errors if a file shows up with a newline in it.

Code:

"{}.mp3"

Since you're using parallel, remove the extension with {.}

Code:

parallel echo {} {.} {.}.mp3 ::: test.mp4

test.mp4 test test.mp3

Code:

#mkdir mp & mv *mp3 mp/ & mkdir m4 & mv *m4a m4/

Just do the moving of files with ffmpeg / parallel

Code:

parallel ffmpeg -i {} mp3_dir/{.}.mp3 ::: *mp4

Anyways good stuff. youtube download scripts are fun

I threw this together

Code:

ytaudio() { parallel "youtube-dl -qx --audio-format 'mp3' -o '%(title)s.%(ext)s' --restrict-filenames {} && echo Processed {}" ::: $@; }

Usage:

Code:

ytaudio link1 [linkn] [playlistlinkn]

ytaudio https://www.youtube.com/watch?v=jItnCGRsMjw

Code:

# ytaudio() {              # We're defining a bash function here                                                            

#  parallel "                                                                  

#    youtube-dl                                                                

#    -q                    # Quiet mode                                        

#    -x                    # Extract as audio file                            

#    --audio-format 'mp3'                                                      

#    -o '%(title)s.%(ext)s'# Output Template == my_music.mp3                    

#    --restrict-filenames  # Remove spaces and special characters              

#    {}                    # Check out other replacement strings for this      

#  &&                      # If previous command succeeds, do this              

#  echo Processed {}"                                                          

#  :::                    # After this, specify files                        

#    $@                    # Takes command line input. ./cmd a b c             

# ; }

can you provide sample lists?

anyway, after a very quick look I just wanted to clean some things up

no need for cat
http://porkmail.org/era/unix/award.html

Code:

cat list1.txt |sed 's/": "/__/g ; s/"/_/g; s/,/|/g ;'|grep -Po '_id__(.{11})_'|sort|uniq | sed 's/_id__//g;s/_$//g' |grep -v \| >list2.txt

Code:

<list1.txt sed 's/": "/__/g ; s/"/_/g; s/,/|/g ;'

anyway, it looks like you want the 11 chars between _id__ and _
many ways to do this, here is one with sed

Code:

<list1.txt sed -n 's/.*__id_$[a-Z0-9]\+$_.*/\1/p'

what did I do?
well.
I substitued the whole line with what was found in the ()
you can save multiple patterns and reorder
e.g.

Code:

# edit didn't reorder 

#<list1.txt sed -n 's/.*\(__id_\)\([a-Z0-9]\+\)_.*/\1\2/p'

<list1.txt sed -n 's/.*\(__id_\)\([a-Z0-9]\+\)_.*/\2\1/p'

so

some junk before __id_gu7d54djkl_ more junk at end
outputs
gu7d54djkl__id_

looking back at your pipe to pipe to pipe again...
it looks like you created the __id_ placeholders

are you sedding json data?
if so you may find jq useful

it has a steep learning curve, but well worth it

as an example, some output from api.tmdb.org

Code:

{"page":1,"total_results":2,"total_pages":1,"results":[{"vote_count":13,"id":45049,"video":false,"vote_average":7.6,"title":"The Code","popularity":0.731,"poster_path":"\/fvIEpbgUS45JLZg6OZpq6ke9wOI.jpg","original_language":"en","original_title":"The Code","genre_ids":[28,53,99],"backdrop_path":"\/kwG1vm97uUFTiGhTgiJYr9aB0AM.jpg","adult":false,"overview":"The Code is a Finnish-made documentary about Linux, featuring some of the most influential people of the free software movement.","release_date":"2001-09-26"},{"vote_count":0,"id":243915,"video":false,"vote_average":0,"title":"LINUX die Reise des Pinguins","popularity":0.6,"poster_path":null,"original_language":"de","original_title":"LINUX die Reise des Pinguins","genre_ids":[99],"backdrop_path":null,"adult":false,"overview":"","release_date":"2009-03-14"}]}

ugly json ;)
but with jq

Code:

<tmdb_api_output.json jq -C "."

{

  "page": 1,

  "total_results": 2,

  "total_pages": 1,

  "results": [

    {

      "vote_count": 13,

      "id": 45049,

      "video": false,

      "vote_average": 7.6,

      "title": "The Code",

      "popularity": 0.731,

      "poster_path": "/fvIEpbgUS45JLZg6OZpq6ke9wOI.jpg",

      "original_language": "en",

      "original_title": "The Code",

      "genre_ids": [

        28,

        53,

        99

      ],

      "backdrop_path": "/kwG1vm97uUFTiGhTgiJYr9aB0AM.jpg",

      "adult": false,

      "overview": "The Code is a Finnish-made documentary about Linux, featuring some of the most influential people of the free software movement.",

      "release_date": "2001-09-26"

    },

    {

      "vote_count": 0,

      "id": 243915,

      "video": false,

      "vote_average": 0,

      "title": "LINUX die Reise des Pinguins",

      "popularity": 0.6,

      "poster_path": null,

      "original_language": "de",

      "original_title": "LINUX die Reise des Pinguins",

      "genre_ids": [

        99

      ],

      "backdrop_path": null,

      "adult": false,

      "overview": "",

      "release_date": "2009-03-14"

    }

  ]

}

and this

Code:

jq -r ".results[0]|.id,.title,.release_date,.overview,.vote_count,.vote_average"

outputs this

Code:

45049

The Code

2001-09-26

The Code is a Finnish-made documentary about Linux, featuring some of the most influential people of the free software movement.

13

7.6

I use jq more and more,
I even wrote a very nasty sed script to convert puluseaudio's `pacmd list-sink-inputs` output to json because it was so much nicer to use jq to automate stuff

tl;dr use jq instead of sed

oops, forgot when I got carried away with jq

Code:

....|sort|uniq...

sort can filter unique

Code:

sort -u

also have a very quick look at ytdl

play around with this

Code:

while read yt_id;do

    echo ${yt_id:2:11}

    # ${var:pos:len}

    # https://www.tldp.org/LDP/abs/html/parameter-substitution.html

    # note, bash usually starts counting at 0, so 2 is the third char.

done < <(

    youtube-dl \

    -a list0.txt \

    -j \

    --flat-playlist \

    --no-check-certificate \

    | jq -r "._filename"

)

you may notice that I did away with the need to write to storage
something else you can try

Code:

raw_ytdl_json=$(

    youtube-dl \

    -a list0.txt \

    -j \

    --flat-playlist \

    --no-check-certificate )

now that you have the json data in memory

Code:

<<<$raw_ytdl_json jq -r "._filename[2:13]"

/!\ note the 2:13 is not 2:11
https://stedolan.github.io/jq/manual/

Quote:

Array/String Slice: .[10:15]

The .[10:15] syntax can be used to return a subarray of an array or substring of a string. The array returned by .[10:15] will be of length 5, containing the elements from index 10 (inclusive) to index 15 (exclusive). Either index may be negative (in which case it counts backwards from the end of the array), or omitted (in which case it refers to the start or end of the array).

Edit: tbh probably not the best idea to stick all that json data into teh var, it could be huge!
the first example is better
I would probable stick them into a bash array and work with them later
something like

Code:

......

my_array+=($yt_id) # instead of that echo

......

#later on 

get_it(){

    id="$1"

    prefix="https://blahblah"

    suffix="some_end_bit"



    some_prog_to_dl_it "${prefix}${id}{suffix}"

}

for ((i=0;i<${#my_array[@]};i++));do

    [[ -e /some/dir/${my_array[i]}.mp4 ]] \

        || get_it "${my_array[i]}" 

done

ok, so I got bored

This is quite dumb, not secure to go piping stuff directly into ffmpeg but for fun I came up with this

Code:

#!/bin/bash



# uncomment for optional proxy

#Proxy="--proxy a_squid_proxy:3128" 



UserAgent="$(youtube-dl --dump-user-agent)"

# probably no need to use the same UA, but what the heck



InputList="$1"



Get_ids(){

    youtube-dl ${Proxy} \

        -a "$InputList" \

        -j --flat-playlist --no-check-certificate \

        | jq -j ".id,\" \",.title,\"\n\""

}

Get_url(){

    youtube-dl $Proxy \

    "https://www.youtube.com/watch?v=${id_title%% *}" \

    --no-warnings --no-check-certificate --skip-download -q -f 140 --get-url

}



Get_mp3(){

while read id_title;do

# jq spat out "dehfuefhhf some song title"

# ${id_title#* } // that deletes the id part

    # skip if we already have 

    [[ -e "${id_title#* }.mp3" ]] && continue



    curl -s  -A "${UserAgent}" ${Proxy} \

        "$(Get_url)" \

        | ffmpeg -hide_banner \

        -i - \

        -c:a libmp3lame -aq 2 \

        "${id_title#* }.mp3"

    # exit after the first one

    exit

done< <(Get_ids)

}



[[ -e "$InputList" ]] \

    && [[ $(file -b --mime-type "$InputList" ) == "text/plain" ]] \

    && Get_mp3

exit

No real checks, just blindly does stuff
and having ffmpeg use stdin from some random internet page is asking for trouble

It would be *much* safer to dl the m4a and then use ffmpeg after testing that the m4a is actually aac data ( which is what you have already been doing ;) )
I just thought it would be fun to skip that step

see if you can come up with some checks on the ids and urls, that they are in the expected length/format.
maybe use parallel to start a chain of
Get_m4a && check_m4a && ffmpeg_to_mp3

since jq is a shiny new toy

Code:

mediainfo --Output=JSON

ffprobe -hide_banner \

        -print_format json \

        -show_format \

        -show_streams \

        -show_chapters

one day I might hack away at pulseaudio to have it output json