Scrapping bank website to access account operations

Python

For several years now, I've build and maintained an excel worksheet for my budget. Every income and expense is logged so that I can plan future expenses, follow where my money goes, plan my saving accounts and so on. However, to do that I needed data i.e. all the transactions written in my account. This takes time to enter manually all the transactions and more time to be sure not to forget one. As I am "lazy" by nature (meaning I can spend hours of my time to code something repetitive, I love automation !), I wanted to quickly code a scrapping script that can retrieve all my transactions. I contacted my bank (Desjardins, a Canadian bank) but I was told that no API was available from them, but they knew that Mint application was able to connect to their website to collect customer account informations, just like what I want to do. They could so am I !!!

The scrapping tools

The best scripting language in my opinion is Python. It has such a huge community, and tons of helping libraries that coding with Python is quite fast and "easy". As I spent some evenings having fun with website (back/front-end) conception, just to lean, I knew how data can interact with web pages with forms, requests, AJAX and I am quite familiar with javascript. So let's the fun begin !

Google Chrome

The first thing to scrap my bank website it to log on with my credentials. This can be also the tricky part because of security of course. The best way to know what data to send to the bank server to log in is to use Google Chrome (or Firefox or whatever you want) with the magic F12 function to access the source code of your page. Lots of information can be found here :

  1. The HTML source code.
    The most simple is the HTML code. This is here you find all the content of you web page, so I focused on the form tags. The forms allows to interact with the server to send data, for example your account number, your password, your personal infos etc. You can also find here the hidden form entries that are send to the server to validate that a request actually comes from the website.
  2. The sources
    The other important part, mostly used when you want to debug a js script, is the sources. This is where you find all the js scripts loaded by the webpage. Especially, we'll focus on the scripts coded by the bank, especially the AJAX calls which send data to the server, or code that generates parts of the web page.
  3. The network
    This is perhaps the most important tool. As you enter in a web page, data comes back and forth to the server. Different methods can be used and the mostly used are the GET and the POST methods. GET method is used to retrieve data (HTML, JSON, XML...) from the server whereas POST method is used to send data to the server. Here, we are particularly interested in the POST method entries because we can find what data is send to the server for example where you log in. To find this information, you just have to select a POST entry and look at the form data section. This information is all you need and this is what you have to focus on when trying to automatically log on a website !

Python

As I already mentionned, I used Python scripting language to code my scrapper. Python is a high level language, very efficient, quite easy and powerful language. I mainly used two libraries that did all the job for me : requests and BeautifulSoup. Request API that helps you simply perform a request to a server and get its response. This is exactly what we need when we want to GET the HTML content of a page or when we want to send data to the server and get its response. Furthermore, requests API has a session functionnality that allows you to keep the cookies, useful for a secured website! When we have the server response, the data can be HTML formatted text, JSON, XML etc. When we get JSON content, it is easy to parse with Python with dictionnaries, but it is more tricky for HTML content. Helpfully, BeautifulSoup is an excellent "library for pulling data out of HTML or XML". This is exactly what I needed to parse the hidden content of a form (for example the tokken that are used to validate access to the server) and later on to retrieve the transactions informations.

The scraping website Code

So here we are. Let's the fun begin ! I provide here the full code of my script. There are several sections, one for each web page (loggin, password, security question, welcome page, data request, data parsing....)

Helping module to manage account informations

And finally helpful functions to buid form data to send to the server, or to write retrieved data to JSON file.

 

Leave a Reply

%d bloggers like this: