blog.speedstor.net -- A blog maintained by a pessimistic over-confident High-School kid.

Wednesday, April 10, 2019

Get youtube automatic transcripts

A week ago, I finally went to making an app that allows people to read youtube rather than wasting their time and watching it. Right now, each video takes about 10 minutes for one person to finish, and I seek to improve this by making the option for people to speed read the transcript. Although I did not finish the app, I did figure out how to automatically get the transcript. And because I really do not have much thigs to say on my blog anymore, I would explain my process of finding this way of getting the transcript.

Firstly, like any of my other apps, I started with a search on google. While I expected a very straightforward answer and that I could immediately go to making an app, the truth is that it's not that simple and people do not just announce their findings when they find it. And this blog is also here to try to change it. With my unsuccessful google search, I was forced to go investigate youtube.com. Within Google Chrome's network debugging tab, there are a cluster of different sources when you visit the site. After examining each and one of them, I finally found that if you turn on transcription, a link of youtube.com/api/timedtext? is used to summon the transcript. But within the get tags of the link, there is a tag called signature and it is different every time you get expired. Youtube.com generates this link and allow their own website to access it. With that said, The signature has no way of just crack and generate ourselves and to reverse engineer a program that I don't even have access is proved to be out of my league. After that, I am only stuck with trying a different way.

Continuing investigating the network tag, I found out that you get a link of youtube.com?getTranscript. With this, I dung into it but hit a road block when I found it is reliant on a post parameter of token_session. If you do not know much about networking, I could tell you that a post parameter is way more difficult to fake compared to just a get parameter. A get parameter is located within a link while a post parameter could not even be seen and is mostly encoded. With that said, it means that the second method of faking a session_token was impossible, so I went back to the first method.

From that, I at that time had identified all the possible way of getting the automated transcript from youtube and that the first way seems to be easier. But still, I cannot figure out how to get the signature. But with luck on my side, and with days upon days of searching and investigating, I finally found the answer and it makes me feel like that I am the dumbest person on earth. The signature and the full link to get the transcript was always exsistant in the youtube video HTML page. Although it is encoded with simple text-formats, it was really visible and straightforward.

In the end, I found out how to get an automated generated transcript from youtube and could finally continuing in making an app for people to speed read youtube. I will be a long journey as because I need to learn android studio and Xcode. Although the answer that I found, in the end, is simple and laughable, it still increased my understanding of networking as because I now know all the in and out to how websites and server communicate.



No comments:

Post a Comment