Open Source - shteynbuk

First Text Analysis Python Project was my first unstructured text analysis project.

My open-source contributions were in several programming languages. The tools and languages used for each contribution are listed at the end of their descriptions.

Pandas and pandas-datareader

While working on a stock screening application using Pandas, I encountered issues that only appeared with specific data values. Initially, I implemented a workaround in my code that functioned correctly but was cumbersome and difficult to maintain. However, this did not address existing code that might behave inconsistently depending on the data. To fix the issue, I contributed changes to Pandas itself as well as to pandas-datareader, which is now maintained as a standalone project.

My contributions included:

Improvements to missing value handling, unit tests, and documentation.
A signature-preserving decorator for compatibility with Python 2.
API-level changes, including updates to documentation and “What’s New” notes.
A temporary workaround for a known issue, along with active participation in related discussions.

Python, Cython, pytest, Sphinx, reStructuredText.

Pandas commits

https://github.com/pandas-dev/pandas/commits?author=olegshteynbuk

pandas-datareader commits

https://github.com/pydata/pandas-datareader/commits?author=OlegShteynbuk

    Pandas pull requests
https://github.com/pandas-dev/pandas/pull/16589
https://github.com/pandas-dev/pandas/pull/16534

    Pandas issues
https://github.com/pandas-dev/pandas/issues/16606
https://github.com/pandas-dev/pandas/issues/16471

    pandas-datareader pull requests:
https://github.com/pydata/pandas-datareader/pull/357
https://github.com/pydata/pandas-datareader/pull/364 (merged from a command line and GitHub GUI  incorrectly displays it as not merged but correctly shows it as merged in the source code
https://github.com/pydata/pandas-datareader/commit/6cce5f18d52be802c7245c8a28d534236a9e2b24 )

    pandas-datareader issues
https://github.com/pydata/pandas-datareader/issues/363
https://github.com/pydata/pandas-datareader/issues/342

              Arelle
    Arelle pull requests
https://github.com/Arelle/Arelle/pull/98

              geWorkbench
At the time I was at The Center for Computational Biology and Bioinformatics(C2B2) at Columbia University.
Most of my work focused on geWorkbench, a Java-based open-source desktop application for integrated genomics.
While working on geWorkbench, I was involved in developing new features, design, enhancements, and bug fixes.
I also contributed to server-side work, primarily on a grid-based application called CaArray, which is an open-source web and programmatically accessible array data management system. My responsibilities with CaArray included installation and administration.
Some tasks required advanced calculus and mathematical modeling.
Java, C++/C, Swing, JBoss, Tomcat, Ant.

GitHub Gist has Python and Java code.

Hedge Fund Investment Projection Program.

This program calculates projections for when hedge fund investors will receive their investments over time, with most calculations performed using Pandas.

The program reads data from an Excel file containing at least two worksheets: Liquidity Terms and Tranche Investments.

The program includes three scripts that generate reports and visualizations based on the data.

The emphasis is on the most common hedge fund withdrawal restrictions.

A more detailed description is in the HedgeFundsRedemption.md file.

NYSE trading days for backtesting

This is a fork of jckantor's Python dateutil rule sets for NYSE trading days and holiday observances. The original rules are valid from the present onward. However, for backtesting or pattern recognition, there is often a need to access NYSE trading days from the past several years. The rules have been modified to provide NYSE trading days and holiday observances starting from 1986.

Java program to perform several dataset operations

This Java program performs basic operations on datasets stored in CSV (comma-separated) files. It reads the dataset into a dataframe to perform various operations.

The program can be used as a library or directly from the command line. Users can define operations using a simple language when running from the command line.

The main purpose of this project is to illustrate that in Java, the absence of a comprehensive library like Pandas makes advanced data processing quite time-consuming. In many cases, you may find it more efficient to use Python and Pandas, even if it requires learning a new language.

That said, if you are a Java developer who doesn't know Python and only needs to perform relatively simple column-based dataset operations, this tool could be a practical option.

For more details, please refer to the project’s README file.

Other contributions

Leo outlining editor

Submitted a bug fix for integration with Aspell, a spell-checking library for C++. Several years ago, Leo switched from Aspell to PyEnchant.

Python, C++.

Submitted code patches to Apache Axis to improve AxisFault logging and provided a temporary workaround.

Java.

Filed bug reports against Swing, JBoss, and several others Java tools and libraries.

Comments