How to scrape tweets using Tweepy
Last Tuesday, the “Sanremo Music Festival” kicked off.
Sanremo is the major Italian music Festival. During the last night show, I heard that the first episode was the most commented on the socials ever.
This news immediately gave me an idea, why not scrape the tweets about #Sanremo2021 to analyze them?
I started googling a bit then I wrote a little script to scrape tweets from Twitter using a library called Tweepy.
Create your Twitter developer account
The first thing we need is a Twitter developer account. If you already have a Twitter account, you need to go here to create a developer account.
After you created a developer account, you have to create a Project (or an app, I created a project but I guess that the procedure is the same for the app).
After the creation of the project or the app, you have to generate 4 codes:
Keep these codes safe because we will need them in the next step.
Tweepy is the library we will use to download the tweets from Twitter. To install Tweepy you can read the tutorial written into the documentation.
At this point, we have installed Tweepy and we are ready to write some code.
First of all, we need to set up the authentication, we have to provide the secret codes we got in one of the previous steps to connect to Twitter and to be able to use the Twitter APIs.
I created a separated file called config.py to store the 4 secret strings. I did this because if I change the secret codes I will have to change them only one time in a single file and it will be easy to publish the code on GitHub ignoring the config.py file.
Now we can start writing our parser. First of all, we need to import the libraries we need:
Then we can define our auth function:
In this piece of code, we access the data we put in the config.py file and we use them to connect to the Twitter APIs.
My goal, in this case, is to download all the tweets which contain the hashtag #Sanremo2021 so I defined a function called “search_by_hashtag”.
This function takes the following parameters:
- api: We need this to make all the calls to the Twitter APIs, we got this through the auth() function.
- date_since: we use this parameter to scrape only the tweets posted after the “date_since”.
- date_until: we use this parameter to scrape only the tweets posted before the “date_until”.
- words: this parameter defines the words (in my case the hashtag) which must be present in the tweet we scrape.
This is the code for the search_by_hashtag function:
It’s all set, we just have to call the functions and to start scraping, we can do this with this piece of code:
Bonus — How to use this scraper in a docker container
At the end of this article, I want to briefly explain how to dockerize the scraper.
I needed to put this script in a docker container to use it on a VPS.
First of all, put all the files you created in a folder and in the same folder create a file called Dockerfile (without any extension) with the following lines of code:
Then, you should create another file called docker-compose.yml, put the following code into it:
To execute the script you just have to launch the container with the command:
docker-compose up -d