Hari's WebLog

~/dev/blog $ less post.md

Web Scrapping on Schedule using Github Actions

Guide to scheduled web scrapping using Github Actions without a DB

Guide to scheduled web scrapping using Github Actions without a DB

Basic idea

To use github repositary as a storage for webscraping using Github actions(software workflow automation CI/CD tool).

  1. Github actions gets triggered when a push happens to the repo or based on a corn job schedule.
  2. It checks out the repo
  3. Runs the scripts
  4. Pushes the entire repo with data back to the github using publish action
yaml code-highlight
name: Scrape Twitter Trends Data on: push: workflow_dispatch: schedule: - cron: '0 */8 * * *' # Every hour. Ref https://crontab.guru/examples.html jobs: build: name: Build runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: get twitter data env: ACCESS_TOKEN_SECRET: ${{ secrets.ACCESS_TOKEN_SECRET }} ACCESS_TOKEN: ${{ secrets.ACCESS_TOKEN }} CONSUMER_KEY: ${{ secrets.CONSUMER_KEY }} CONSUMER_SECRET: ${{ secrets.CONSUMER_SECRET }} run: | pip3 install install pandas tweepy pytz python3 twitter_trends.py - uses: mikeal/publish-to-github-action@master env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # GitHub sets this for you

if the script uses sensitive api keys then it can be declared using github secrets

Limitation

Example repositories

Happy data scraping/archiving!!

This work is inspired from user sw-yx repo