Web Scrapping on Schedule using Github Actions
Guide to scheduled web scrapping using Github Actions without a DB
Guide to scheduled web scrapping using Github Actions without a DB
Basic idea
To use github repositary as a storage for webscraping using Github actions(software workflow automation CI/CD tool).
- Github actions gets triggered when a push happens to the repo or based on a corn job schedule.
- It checks out the repo
- Runs the scripts
- Pushes the entire repo with data back to the github using publish action
yaml code-highlightname: Scrape Twitter Trends Data
on:
push:
workflow_dispatch:
schedule:
- cron: '0 */8 * * *' # Every hour. Ref https://crontab.guru/examples.html
jobs:
build:
name: Build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: get twitter data
env:
ACCESS_TOKEN_SECRET: ${{ secrets.ACCESS_TOKEN_SECRET }}
ACCESS_TOKEN: ${{ secrets.ACCESS_TOKEN }}
CONSUMER_KEY: ${{ secrets.CONSUMER_KEY }}
CONSUMER_SECRET: ${{ secrets.CONSUMER_SECRET }}
run: |
pip3 install install pandas tweepy pytz
python3 twitter_trends.py
- uses: mikeal/publish-to-github-action@master
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # GitHub sets this for you
if the script uses sensitive api keys then it can be declared using github secrets
Limitation
- To handle large files github provides git-lfs
- Github actions usage limits
Example repositories
Happy data scraping/archiving!!