Python


From Java to Python

Several years ago, while I was primarily focused on Java development, our team needed a tool for data manipulation and analysis. After evaluating our options, the Python library Pandas emerged as the clear winner. Pandas is practically a world of its own — learning it arguably requires more effort than learning Python itself. I ended up using Pandas extensively and even contributed to the project Pandas (v0.21.0) and pandas-datareader (v0.5.0). You can find more details about these contributions here

Before this, I had used Python occasionally, but my knowledge of the language was fairly superficial. That didn’t stop me from contributing to a Python-based open source project many years ago. My contribution focused on interoperability with a C++ library, so it was arguably more C++ than Python.

Over time, my exposure to Python deepened. I’ve worked with various Python modules, packages, and tools — including multithreading and concurrency. My first Python multi-threaded program run slower than the single-threaded one; it was before Python 3.13. Coming from Java, this was unexpected.

It turned out the Python interpreter uses a Global Interpreter Lock (GIL), and the reason for this is that the Python interpreter is not thread safe(in CPython pre 3.13 versions). You need an I/O bound program to see the benefits of multithreading in Python.

In our case, the solution was to use multiprocessing, which scales well for distributed processing. Another solution would be to do computationally intensive tasks in C++/C and use Cython.

As some of our tasks turned out to be I/O bound we added the task execution section to the system configuration. The user can configure the task execution engine, like the number of processes or/and threads, and to run experiments to find out the optimal number of processes/thread. Initially, we built a custom solution using Python’s standard library, and later added a Redis and RQ. We’re also evaluating distributed frameworks like Spark and Dask.

We also use Python as a portable scripting language, with scripts running on both Linux and Windows.


Transitioning from Python 2 to Python 3

The migration from Python 2 to Python 3 was a rather big project. Our first step was to ensure the codebase was compatible with both versions, eliminating the need to maintain two separate codebases. This process also gave us the opportunity to review, clean, and refactor the code-especially sections that had been written when we were new to Python.

The transition was made smoother because we had been preparing for it, and all new code was written with Python 3 compatibility in mind whenever possible.

More detailed description of our transition experience deserves its own entry. Here I just want to mention one side effect of the transition: code readability. After large-scale code conversion, especially when using automated tools, the resulting code is often less readable. Code readability is, after all, one of the main reasons Python has become so popular.

A significant part of the conversion was an intermediate step, making the code compatible with both Python 2 and Python 3. This step introduced some extra code that is no longer necessary. We are very happy with the results and plan to clean up the remaining transitional code when time permits.


What's Next

In future posts, I’ll describe examples of different Python features with examples from my projects. A significant portion of my Python development involves data processing, web development, and scripting. Some of my data processing projects might require a separate entry, like, extracting financial data from EDGAR regulatory filings (XBRL format, unstructured text)

First Text Analysis Python Project was my first unstructured text analysis project.

Some samples of Python code are on GitHub Gist

This program calculates projections for when hedge fund investors will receive their investments over time, with most calculations performed using Pandas.

The program reads data from an Excel file containing at least two worksheets: Liquidity Terms and Tranche Investments.

The program includes three scripts that generate reports and visualizations based on the data.

The emphasis is on the most common hedge fund withdrawal restrictions.

A more detailed description is in the HedgeFundsRedemption.md file.

This is a fork of jckantor's Python dateutil rule sets for NYSE trading days and holiday observances. The original rules are valid from the present onward. However, for backtesting or pattern recognition, there is often a need to access NYSE trading days from the past several years. The rules have been modified to provide NYSE trading days and holiday observances starting from 1986.

This website was created using Pelican, a static site generator written in Python.

Here are some of the Python libraries and tools that I use:

Pandas, NumPy, Beautiful Soup, Requests, Selenium, Cython, SciPy, NLTK, Matplotlib, pytest, unittest, Django, Flask, Redis, RQ, Pelican, reStructuredText, Sphinx.

View comments.

more ...

C++


Below is a brief overview of some of the C++ projects I was responsible for. The information about these projects is already in the public domain.

  • Client Infrastructure Library Wrapper
The infrastructure was quite complex, serving as a company-wide system that included a persistence layer, a shared cache, and a messaging component. The goal of the client infrastructure library wrapper was to hide most of this complexity from end users. This approach not only simplified the use of the infrastructure but also made it compatible with third-party libraries and legacy code. For more information see Presentations .
  • Database Design and DB Library Wrapper

All database-related activities were my responsibility, covering both data management and code. We have used the c-tree Plus library as an embedded db engine in the application process address space.

To simplify and standardize database access, I developed an application-specific Object-to-Datastore Mapping library-essentially a wrapper around c-tree. This wrapper was used by the entire development team for all database calls.

Later, I added SQL support to the library, enabling integration with Sybase as an SQL database. The transition from c-tree to Sybase was smooth, and developers had the flexibility to choose between an embedded database and a Sybase server, depending on the needs of each project.

Beyond these projects, I participated in various C++ discussion groups, provided C++ consultations, and taught C++ classes. I also moderated the NYC Design Patterns Study Group and posted summaries of our discussions; most of the code samples were in C++.

After reading an early edition of Scott Meyers’ Effective C++, I reached out to the author with my comments. This led to an extensive email correspondence, which continued after the publication of his next book, More Effective C++. Scott Meyers cited me in subsequent editions of both Effective C++ and More Effective C++.

Below are some of the C++ libraries and tools I frequently used. For garbage collection, I often relied on smart pointers.

STL, Rogue Wave libraries, Orbix, CORBA, TIBCO, c-tree Plus

View comments.

more ...

Presentations


These presentations describe a wrapper around a client C++ infrastructure library. The infrastructure was quite complex, serving as a company-wide system that included a persistence layer, a shared cache, and a messaging component. The goal of the client infrastructure library wrapper was to hide most of this complexity from end users. This approach not only simplified the use of the infrastructure but also made it compatible with third-party libraries and legacy code. This design was greatly influenced by the STL adapter classes, the reference classes from the IBM collection class library, and the Orbix TIE approach. In the External Polymorphism Pattern this design is cited as an independent discovery of a variant of the pattern.

These two presentations describe the same work but from different angles. The UP'97 (international conference on Using Patterns) presentation is more detail-oriented and with more code samples, while the COOTS'97 (Conference on Object-Oriented Technologies) presentation is more design oriented. The last sentence of the COOTS'97 presentation started heated debates. Among participants in the debates were Bjarne Stroustrup, Scott Meyers, Doug Lea, Andrew Koenig.


COOTS'97   -     HTML     PDF

UP'97           -     HTML     PDF


View comments.

more ...

Tea

Tea was always a big part of my life.

Like many in the Western world, I started with black tea - which is known as red tea in China . After trying nearly every type of tea, I settled on Pu-erh tea, which in China is often called black tea.

Pu-erh is earthy and warm, with an aroma and taste that can vary significantly depending on where it’s grown (even the slope of the mountain matters), how it’s processed, stored, aged, and which factory produced it - pretty much like wine in that sense.

Another great thing about pu-erh is that it can be aged, and in many cases, it gets better with time — like scotch. Isn't that great? Well, most of the time it is. Occasionally you get a miss — a tea that actually tasted better when it was young.

I drink other types of tea too, just not as often. If I ever run out of pu-erh (which is unlikely — I probably have enough to last me the rest of my life), I’d reach for oolong or green tea next, followed by black and white.

Most of my tea comes from China, and I usually brew it using a slightly modified version of Gongfu cha. I’ll be posting photos of my collection soon — I’m currently in the process of retasting my teas and updating my notes.

And in case you haven’t noticed: the bootstrap theme for this site is inspired by tea.

View comments.

more ...

Web Development


In the early days of web development, it was the server that generated web pages and the browser just displayed them. Now the task of generating web pages is split between the server and the client(browser), and there are a lot of discussions on the web what should be done on the server and what on the client; and a big part of it is communication between the client and the server. My experience was more on the server side, but I have done a fair share of the client side development too.

Web scraping, which I have done a fair amount of, is not exactly web development, more like analyzing website and extracting data from it; but there is an intersection of technical skills with web development that justifies putting it here instead of creating a separate blog entry. And the more complicated and dynamic the client site is, the more difficult the scraping is.

Below are some of the libraries and tools that I use:

Django, Flask, Redis, RQ, Celery, RabbitMQ, Docker, Heroku, Netlify, Pelican, HTML, CSS, JavaScript, jQuery, AJAX, Bootstrap, reStructuredText, REST, Beautiful Soup, Jsoup, Selenium, XML, JSON, YAML. JBoss, Tomcat, Glassfish.

View comments.

more ...

DB - SQL, NoSQL

I have used different databases - SQL(relational) and NoSQL(non-relational), depending on the project.

In Python often I would use Pandas to load data from db into a dataframe, process the data, and save the data.

In Java, I have used either API like JDBC or ORM like Hibernate.

In C++ i have used db library c-tree Plus, which allows you to design your own index structures and have a db of SQL or NoSQL flavor. Below is a description of how c-tree Plus was used in one of my projects.

All database-related activities were my responsibility, covering both data management and code. We have used the c-tree Plus library as an embedded db engine in the application process address space.

To simplify and standardize database access, I developed an application-specific Object-to-Datastore Mapping library-essentially a wrapper around c-tree. This wrapper was used by the entire development team for all database calls.

Later, I added SQL support to the library, enabling integration with Sybase as an SQL database. The transition from c-tree to Sybase was smooth, and developers had the flexibility to choose between an embedded database and a Sybase server, depending on the needs of each project.

SQL

Often the projects that I was on required different levels of SQL handling.

For one project i did a conversion from Oracle, PL/SQL to Python, Pandas, MySQL. Pandas was used, but a lot of work was done in SQL scripts.

For another project that was PostgreSQL based, I wrote a database schema conversion script and developed a synchronization module between an application cache and a database. Triggers were developed in PL/pgSQL, C, and Python. It's a very nice feature of PostgreSQL to have different languages that can be used for stored procedures; in my case, Python code that replaced the code originally written in PL/pgSQL was more simple.

NoSQL

My experience with NoSQL databases includes OO db, Graph db, and in-memory key–value database.

OO db was used for a Quartz based project.

Graph db. On one of the projects I was assigned a ticket to fix a time critical task. It turned out that the task was spending nearly an hour on one db query when other queries were in seconds, it was - Neo4j. A workaround around Neo4j issue was found and the time of the query was reduced from about an hour to several seconds, Neo4j has very good docs.

Below are some of the databases that I was using:

PostgreSQL, Oracle, MySQL, SQLite, Sandra, Neo4j, SYBASE, SQL Server, Redis

View comments.

more ...

Development Environment


Currently, most of my development is done on Linux or/and Windows.

Version control is Git. In the past I have used Mercurial, Subversion, CVS.

Most of my scripting, these days, is in Python and sometimes in Ant. And the main reason for this is that Python and Ant scripts can be executed on Linux and Windows. I have also used Perl and Groovy. I use the system shell when I have to, portability has its limitations and sometimes you just need to use the system shell. On several projects I was responsible for setting up and managing development environment.

Most of the time I use PyCharm, sometimes Eclipse, PyDev, Visual Studio Code, or any of available text editors, depending on the environment.


View comments.

more ...

Open Source

GitHub

First Text Analysis Python Project was my first unstructured text analysis project.

GitHub Gist


My open-source contributions were in different languages. Tools and languages that were used are listed at the end of each contribution description.


              Pandas and pandas-datareader
While using Pandas in a stock screening application, encountered problems that were present only for certain data values. The first solution was a workaround. It worked, but the solution was rather cumbersome and ugly. And what about existing code that might work with some data but not with other. Modifying Pandas code looked like a better solution. The changes actually affected two projects: Pandas and pandas-datareader, as pandas-datareader is now a stand alone project.
Below are the changes:
Made improvements to missing values handling, testing, documentation.
A signature-preserving decorator for Python 2.
Some changes were an API changes, which also included changes to what's new and documentation.
Provided a workaround for one of the issues as a temporary solution while the issue is being resolved, participated in discussions.
Python, Cython, pytest, Sphinx, reStructuredText.



    pandas-datareader pull requests:
https://github.com/pydata/pandas-datareader/pull/364 (merged from a command line and GitHub GUI incorrectly displays it as not merged but correctly shows it as merged in the source code https://github.com/pydata/pandas-datareader/commit/6cce5f18d52be802c7245c8a28d534236a9e2b24 )


              Arelle
    Arelle pull requests


              geWorkbench
At the time I was at The Center for Computational Biology and Bioinformatics(C2B2), Columbia University.
Most of my work was on geWorkbench, a Java-based open-source desktop application for integrated genomics. While on geWorkbench I was involved in new features, design, enhancements, and bug fixes.
There was some server side work too, mostly on a grid based application CaArray, which was an open-source web and programmatically accessible array data management system. Part of my CaArray responsibilities included installation and administration.
Some of the tasks required advanced calculus and mathematical modeling.
Java, C++/C, Swing, JBoss, Tomcat, Ant.


                GitHub Gist has Python and Java code.

This program calculates projections for when hedge fund investors will receive their investments over time, with most calculations performed using Pandas.

The program reads data from an Excel file containing at least two worksheets: Liquidity Terms and Tranche Investments.

The program includes three scripts that generate reports and visualizations based on the data.

The emphasis is on the most common hedge fund withdrawal restrictions.

A more detailed description is in the HedgeFundsRedemption.md file.

This is a fork of jckantor's Python dateutil rule sets for NYSE trading days and holiday observances. The original rules are valid from the present onward. However, for backtesting or pattern recognition, there is often a need to access NYSE trading days from the past several years. The rules have been modified to provide NYSE trading days and holiday observances starting from 1986.

This Java program performs basic operations on datasets stored in CSV (comma-separated) files. It reads the dataset into a dataframe to perform various operations.

The program can be used as a library or directly from the command line. Users can define operations using a simple language when running from the command line.

The main purpose of this project is to illustrate that in Java, the absence of a comprehensive library like Pandas makes advanced data processing quite time-consuming. In many cases, you may find it more efficient to use Python and Pandas, even if it requires learning a new language.

That said, if you are a Java developer who doesn't know Python and only needs to perform relatively simple column-based dataset operations, this tool could be a practical option.

For more details, please refer to the project’s README file.

    Other contributions
Submitted a bug fix for integration with aspell, spellchecking C++ library. Several years ago Leo switched from aspell to PyEnchant.
Python, C++.

Submitted code patches to Apache Axis to improve AxisFault logging and provided a workaround.
Java.

Filed bug reports against Java Swing, JBoss, and several others Java tools and libraries.

View comments.

more ...

Java


My Java experience is evenly split between Core Java and J2EE.

In Core Java, I have utilized various packages and tools, including collections, multithreading, and concurrency. A partial list of the Java packages and libraries I’ve used is included at the end of this entry.

Many Java applications begin as single-threaded, but as performance demands grow, they often evolve into multi-threaded applications. Java provides robust concurrency APIs.

In J2EE, my primary focus has been on server-side development, but I also have significant experience with client-side development. On the client side, I have developed both desktop-based and browser-based clients. For desktop applications, I have used Swing. My browser-based clients also included non-Java technologies such as JavaScript, HTML, and CSS. On the server side, I’ve worked with various application servers and libraries — a partial list can be found at the end of this entry.

I’ve also worked with XML processing, integrated various data feeds, created and consumed web services, as well as extracted data from websites (web scraping) and was responsible for the migration from Java 8 to Java 11.

While working with Java, I’ve been involved in both new application development and maintaining legacy systems. Probably a separate entry is needed for a more detailed description of the projects.

You can find a few Java code samples on my GitHub Gist

This Java program performs basic operations on datasets stored in CSV (comma-separated) files. It reads the dataset into a dataframe to perform various operations.

The program can be used as a library or directly from the command line. Users can define operations using a simple language when running from the command line.

The main purpose of this project is to illustrate that in Java, the absence of a comprehensive library like Pandas makes advanced data processing quite time-consuming. In many cases, you may find it more efficient to use Python and Pandas, even if it requires learning a new language.

That said, if you are a Java developer who doesn't know Python and only needs to perform relatively simple column-based dataset operations, this tool could be a practical option.

For more details, please refer to the project’s README file.

Below are some of the libraries and tools that I was using:

Core Java, Java EE (J2EE), JSP, Spring, Hibernate, EclipseLink, JPA, Ejb, JMS, MQ, TIBCO, Twitter4J, Jsoup, Tomcat, JBoss, Glassfish, Log4j, Guice, JUnit, Mockito, jQuery, Tomcat, JBoss, Apache Axis, JXPath, JAXB, Swing, Restlet, Ant, Maven.

View comments.

more ...

Miscellaneous Development


This page is a temporary place for some of the topics, in no particular order, that I was working on and that don't have their blog entry yet, but hopefully will find their home soon.


Multi-threaded and distributed programming, concurrency. Java vs Python - multithreading and multiprocessing, GIL.

Messaging, Distributed cache - Redis, RQ, Celery, RabbitMQ, JMS, Tibco, IBM MQ, Oracle Coherence. On one project we have used Qpid for testing and IBM MQ for production.

RSS and Twitter feeds processing. Twitter4J, JAXB.

Science and Math; data structures and algorithms - Pandas, NumPy, SciPy.

Decision support system for railroad network management. Mathematical model describing flow of railroad cars in railroad region.

Lexical analyzer and parser development.

sysadmin and dbadmin tasks. Internal Maven repository set up.

Troubleshooting and resolving production issues.

Working with legacy code, maintenance, enhancement, refactoring.

Other languages - Fortran, Assembler, Pascal.

Design, Design Patterns.

Publications - decision support systems, operating systems, numerical solutions of partial differential equations.

Taught programming, math as a part-time job.

Participated in different Python, Java, and C++ forums and mailing lists. Moderated The NYC Design Patterns Study Group and posted summaries of the discussions.

View comments.

more ...

Technical Writing


Technical writing is sometimes defined as simplifying the complex(techwhirl.com), the same goal that I am trying to achieve in my code. There is a saying that explaining helps you better understand what you are doing, and in this sense writing documentation often helps me better understand the code.

If the code has any value people will read and modify it. Any documentation that clarify the code will be helpful, even when reading your own code many years later.

And the same could be said about user guides, which helps the users of a tool or a library and don't force them to read the code in order to use it. One of the examples would be a J2EE project where I was responsible for the project migration to Maven. Part of my job was setting up a project structure, including the project build and deployment; and also setting up an internal Maven repository. Writing the user guides had an extra benefit: members of the team were able to use Maven without asking me a lot of Maven related questions.

Most of my technical writing was part of the internal documentation and isn't in the public domain. What is on this site is in the public domain, as well as my open source contributions that include documentation.

Even as technical writing was a relatively small part of my job and not a full-time occupation, I took several academic writing courses at Columbia University, and passed level 10 exam; my teachers liked my writing and encouraged me. I really enjoy the creative process of writing.

Below are some of the tools that I am using:

reStructuredText, Markdown, Sphinx, wiki.

View comments.

more ...

Testing


I have done functional and unit testing. There was even a period in my career when testing was my main responsibility, but this was some time ago.

Ideally, you want to test everything. But there is a lot of prototyping and experiments, especially in Python, and writing tests for throwaway code looks like a waste of time. And there are deadlines, and testing often is the first casualty of deadlines. One place where testing is rigorously enforced is Open Source development, and it is definitely the case with Pandas. Any Pandas contributions should have a corresponding test case, and I learned a lot about Python testing as a result of my Pandas contributions .

In C++ I wrote a client/server Test Bed for System Testing of infrastructure. It was the time when testing was my main responsibility.

In Java and Python, I wrote a fair amount of unit tests, but they were mostly run of the mill unit tests, similar to what you can find online.

Below are some of the libraries and tools that I was using:

pytest, unittest, mocking, Selenium. JUnit, Mockito, PowerMock

View comments.

more ...

Ant Script to Update/Install Eclipse


This Ant script, originally written several years ago and previously hosted on this site, is now available on GitHub.

At the time the script was created, Eclipse was still relatively new. While it offered extensibility, many essential features were left to plugin developers. As a result, Eclipse users often found themselves waiting anxiously for new builds or updated versions of key plugins.

However, updating Eclipse — or its plugins — was anything but straightforward. There were many undocumented or poorly documented rules, and plugin packaging conventions varied widely. Manual installation was common, and sometimes updates introduced compatibility issues that could render a workspace unusable. The script addressed all these issues and was easily customizable; sometimes you just need a command line tool.

It's hard to estimate how widely the script was used, but there were signs of the script usage: user emails with questions, feature requests, suggestions, and code contributions. The script was reviewed in several blogs and newsgroups; and, for a time, it ranked at the top of Google search results for "Ant script." At the time Apache Ant itself was gaining popularity as a build tool - an alternative to UNIX make utility - and this script extended Ant’s utility.

The following text was written alongside the script’s initial release. While some statements may now seem dated, they’ve been left untouched to preserve the context and spirit of the time:

Eclipse is a great IDE. It is relatively new and the speed with which Eclipse team introduces new features is amazing. However, my first update to a new Eclipse build was rather time consuming; and from reading news I realized that other developers were struggling with updates too. The script simplified updates and, over time, evolved to incorporate some other related activities. All the documentation is in the Readme file. One chapter in the Readme file, named What is the Right Way to Update?, is different from the other chapters as this chapter is not about how to use the script but rather why and when to use it.

You can download a zip file from this site or just browse Readme file online. Latest changes are in Release Notes. It is worth noting that Eclipse comes with the Update Manager but it seems to be for major releases and not for builds. The primary audience for this script is developers who update Eclipse quite often.

Read What Others Have Said:

http://youarenumber6.blogspot.com/2004/08/departmental-eclipse-with-ant.html

http://www.jroller.com/page/dorodok/20030106


View comments.

more ...