Python


Several years ago, when I was doing mostly Java development, we needed a data manipulation and analysis tool, and the Python library Pandas came out as a clear winner. Pandas is the world in itself, and learning Pandas arguably requires more efforts than learning Python. I have used Pandas rather extensively and made several contributions to Pandas (v0.21.0) and pandas-datareader (v0.5.0), more details about these contributions are here

Occasionally I used Python before, but my knowledge of Python was rather superficial, which didn't stop me many years ago from contributing to an Open Source project that was in Python. My contribution was about interoperability with C++ library, so it might had been more C++ than Python.

I used various Python modules, packages, and tools including multithreading and concurrency. My first Python multi-threaded program was actually running slower than the single-threaded one.

Coming from Java, it was not what I expected. As it turned out the Python interpreter uses a Global Interpreter Lock (GIL), and the reason for this is that the Python interpreter is not thread safe(I use CPython). You need an I/O bound program to see the benefits of multithreading in Python. There are debates on the web about the GIL with a lot of proposed solutions and each of the solutions has its pros and cons.

In our case, the solution was to use multiprocessing, which scales well for distributed processing. Another solution would be to do computationally intensive tasks in C++/C and use Cython. As some of our tasks turned out to be I/O bound we added the task execution section to the system configuration. The user can configure the task execution engine, like the number of processes or/and threads, and to run experiments to find out the optimal number of processes/thread; by default, there are no threads - only processes. Custom solution using modules from the Python standard library was developed first and later Redis, RQ based solution was added. We are also considering using a distributed framework like Spark or Dask.

We also use Python as a portable scripting language, as our scripts run on Linux and Windows.

Moving from Python 2 to Python 3 was a rather big project. The first step was to make the source code both Python 2 and 3 compatible; this eliminated any need to maintain two different code bases during the transition. It was also a good opportunity to go over the code and do some cleaning and refactoring, and was especially true for the code that was written when we were new to Python. The transition to Python 3 was helped by the fact that we were preparing for transition for some time and all new code was written with Python 3 in mind, whenever it was possible.

More detailed description of our transition experience deserves its own entry. Here I just want to mention one side effect of the transition - the code readability. Probably after any code conversion, you will end up with a less readable code, especially when using conversion tools. And code readability is one of the main reasons that made Python so popular. A big part of the conversion was an intermediate step, making the code Python 2 and Python 3 compatible. This intermediate step introduced some extra code that is not needed anymore, we are Python 3 only from now on and are very happy with the results of the transition, just need to find some extra time for the code cleaning.

Later I will describe some of the cool features of Python with examples from my projects, probably in a separate entry. A big part of my Python development is data processing and web development, along with various scripts. Some of my data processing projects might require a separate entry, like, extracting financial data from EDGAR regulatory filings (XBRL format, unstructured text)

First Text Analysis Python Project was my first unstructured text analysis project.

Some samples of Python code are on GitHub Gist

The program reads data from an Excel file, which should have at least two worksheets: one for the funds liquidity terms and the other for the tranche investments. There are three scripts to produce reports and graphs. The focus of the program is on the most common hedge funds withdrawal restrictions. A more detailed description is in the HedgeFundsRedemption.md file.
This is a fork from jckantor for Python dateutil rule sets for NYSE trading days and holiday observances. The original rules are valid for time from now on. Sometimes for backtesting or pattern recognition there is a need to have NYSE trading days for the past several years. The rules were modified to produce NYSE trading days and holiday observances from 1986 and on.

This website was created using Pelican, which is a Python tool.

Below are some of the Python libraries and tools that I use:

Pandas, NumPy, Beautiful Soup, Requests, Selenium, Cython, SciPy, NLTK, Matplotlib, pytest, unittest, Django, Flask, Redis, RQ, Pelican, reStructuredText, Sphinx.

View comments.

more ...