I've been given a dataset and I need to assess its quality.

Use Pandas Profiling to quickly generate a document that will provide you with a first overview of the data.

Your first step should be to look for warnings and messages at the top of the document. Look for entries about missing values, those will point you to variables that may need attention during the data cleaning and data imputation phases of your machine learning problem. As you are doing an assessment, simply indicate that data is missing in these variables and then see if you can determine why by looking at a few examples by loading the data in a pandas dataframe.

Are there a lot of duplicated rows? Depending on the data you've been provided, this may help you identify whether or not something is wrong with the data you were provided. If all entries are supposed to be unique because they represent a single (entity, timestamp, target) tuple, then you should ask yourself why it isn't the case. Is it possible that the dataset was created by appending a collection of other documents, leading to duplicate lines? If so, you may have to do some dataset preprocessing in order to get rid of duplicate rows.

Look for variables that are indicated as highly correlated with other variables. High correlation means that it may be possible that one variable has exactly (or almost) the same values as the other variable, which would provide little information to a machine learning model. It would also mean that picking one variable out of two correlated variables would avoid the cost of storing both.

Look at each variable in turn and view its details.

Look at the distribution of values. Are they uniformly distributed, normally distributed, binomially distributed, etc.?

If there are only two possible values for a variable, are those values approximately the same or one value is dominant compared to the other? Were you to try and predict this variable, you would have to deal with class imbalance.

Are the values of the variables sensible to you? Are variables composed of multiple information, such as the value and the unit used for the measurement? You would generally prefer composite values to be separated into different variables as it will be easier to process using machine learning models.

When looking at numbers distribution, are there outliers (values that are either a lot smaller or larger than the rest)? It is sometimes important to ask those who provided you with the data if they can explain those outliers. In general you will want to ignore outliers during training as they may skew your model toward them, resulting in less than ideal results for all the other data points.

The quality of a dataset is inversely proportional to the number of operations you need to apply to it to make it a clean dataset. That is to say that if you don't need to do anything on the data provided to you, then it is a good dataset.

There are no tests on the project I joined. How do I get started?

When joining a new project without tests, here is the value you need to provide through the addition of tests:

  1. the application works and doesn't crash
  2. the application works and supports a few input cases
  3. the application works and supports a variety of input cases
  4. the application works and is robust to most input cases

Start by finding the main entrypoint of the program and call it in a test. Your test doesn't have to do much, other than ensuring you can start the program and possibly exit. Your goal should not be to assert anything yet, but to exert the code. Create a few tests that do very few things other than starting and terminating the program. Once you've covered a few use cases, you can use those tests to ensure that the application can start, do a few things, then terminate without crashing.

Start unit testing the various parts of the code that are critical. To determine what is critical, you'll have to dig into the code. With the tests you initially wrote you will get a sense of the "critical" pieces, simply due to the fact that they are executed whenever the program starts and stops, which are two things you always want to work.

When writing unit tests, always aim to write a test case that covers the happy path first. You want to demonstrate that the functionality a class or function is supposed to provide is there first and foremost. Then you want to test its robustness and its ability to handle a variety of input cases. Given a large codebase, start by covering most of the code with the happy paths before you start to dig into the special cases.

04 Mar 2020

List outdated packages using poetry

History / Edit / PDF / EPUB / BIB / 1 min read (~145 words)
Problems Python

I use poetry as my python package manager and I'd like to know the packages that I depend on that are currently outdated.

An easy way to get this list is to run poetry show --outdated. This will return you a list of all the packages that are outdated, their current version, the latest version, as well as a description of the package.

There are in my opinion three missing features here:

  • having the command respect the semantic versioning constraint and only letting you see the latest version according to those constraints
  • having a flag to switch between showing the latest version available without semantic versioning constraint vs the latest version constrained by semantic versioning
  • having a flag to list only the packages that are direct dependencies (listed in the pyproject.toml file)

03 Mar 2020

Mypy and implicit namespaces

History / Edit / PDF / EPUB / BIB / 1 min read (~78 words)
Problems Python

I use mypy but it doesn't seem to scan all my files. Why?

You might be using implicit namespaces in your code (see PEP 420). Support for implicit namespaces in mypy is rather flaky as of 2020-03-03.

One solution for the moment is to add __init__.py and make all your namespaces explicit.

Another solution is to replace your calls to mypy some-path with mypy $(find some-path -name "*.py").

I want to quickly take notes in the same file throughout the day using VS Code. How do I do that?

My approach has been to use my VS Code extension Run Me, which I use to bind a keyboard shortcut to one of the commands I created. In my particular case, on any VS Code window I can press CTRLNumpad 2 and it will open a file under the following path: buffer/YYYY/MM/DD.md, where YYYY/MM/DD is replaced with the year/month/day. In this file I record all my notes within the day.

I use the following snippet, which I can trigger using dt, then pressing TAB. This replaces the dt string with a string of the form YYYY-MM-DD HH:MM:SS, which is the current year-month-day hour:minute:second.

{
    "Datetime": {
        "scope": "",
        "prefix": "dt",
        "body": [
            "$CURRENT_YEAR-$CURRENT_MONTH-$CURRENT_DATE $CURRENT_HOUR:$CURRENT_MINUTE:$CURRENT_SECOND"
        ],
        "description": "Date time"
    }
}

I also use the Script Commands extension to do something slightly more complicated, which is to create strings of the form 2020-03-02 21:19:05 [nid://952], where nid://952 represents a unique note id (nid). The number that is generated is unique and is tracked by storing the last generated number in a text file that is read/written on each call to this command. A cheaper approach could have been to simply use the timestamp as unique note id. One downside of the timestamp as note id approach is that you don't have an idea of how many notes you've recorded so far, other than searching your notes and then counting the number of unique instances.