Ramblings of a Coder's Mind

MLOps: Building a healthy data platform

2021-08-02T00:00:00+05:30

Spoiler: MLOps is to ML Platforms what DevOps is to most tech products. If you think this means MLOps is automating your deployments, this article is for you.

What is DevOps and how is it so much bigger than automating deployments?

You know that a term you coined has made it mainstream when people use it regularly in conversations and rarely understand what you meant.

— Martin Fowler (paraphrased from an in-person conversation)

Rouan summarises DevOps culture well in his post on Martin’s bliki. It is easy for developers to get disinterested with operational concerns. “It works on my machine” used to be a common phrase between developers in yesteryears. Some operations folks can also be less concerned by development challenges. Increased collaboration can help build a bridge in the gap between Developers and Operations team members and thus make your product better.

This increased collaboration has made observed requirements like system and resource utilisation monitoring, (centralised) logging, automated and repeatable deployments, no slow-flake servers etc. key parts of our products. Each of these improve the quality of life of your product either by directly benefiting the end user or making the system more maintainable for Developers and Operations users thus reducing the time to fix issues for end user issues. Developers and Operations folks are also first class users of your system. Their happiness (ease of debugging issues, deploying etc.) is a key part of your product’s success. It allows them to spend more time improving your product for paying end users.

What is MLOps?

MLOps is a culture that increases collaboration between folks building ML models (developers, data scientists etc.) and people who monitor these models and ensure everything is working as intended (operations). The observed requirements in your system will have some overlaps with what we have already talked about like system and resource monitoring, (centralised) logging, automated and repeatable deployments, automated creation of repeatable (non-snowflake) infrastructure etc. It will also include a few Data Platform specific observed requirements such as model and data versioning, data lineage, monitoring effectiveness of your model over an extended period of time, monitoring data drift etc.

Some tools/techniques to build a robust data platform

The need of every data platform is slightly different based on the challenges you are solving and the scale at which you operate. One of the platforms I’ve been working on produces 2TB of data every week. It didn’t take too much time for data storage costs to be the number 1 line item on our bill and we invested some time in optimising our storage and retention strategy. Other teammates have lowered data volumes and focus on reducing the cycle time for model creation. Your mileage may vary.

Based on our experience building data platforms over the past few years, here are a few tools we have used and things we have watched out for.

Data Storage

Choose a storage mechanism that provides cheap and reliable access to your data while meeting all legal requirements for your dataset. If you are in a heavily regulated environment (finance, medicine etc.), you might not be able to use the cloud for customer data. The techniques still remain similar. Partition your data based on access requirements and retention times. Archive data when you do not need it. Use features like push down predicates to efficiently read your data.

We recently wrote about data storage, versioning and partitioning which goes into great depth into this topic.

Job Scheduler/Workflow orchestrator

Your data pipelines will get complex over a period of time. Much like infrastructure as code, we would like our data pipelines in code. Apache Airflow is one of the tools that allows us to do this fairly easily. Sayan Biswas wrote about our airflow usage in 2019. Over the last few years, we have made dozens of improvements to the way we use Airflow. In a subsequent post in this series, we will talk through these improvements.

Monitoring and managing data processing costs

We spawn EMR clusters on demand and terminate them when jobs complete. A cluster runs only 1 spark job (and a few extra tasks for cleanups and reporting). If a job fails due to resource constraints, this helps isolate if another hungry job consumed too many resources before a scaling policy kicked in.

Each EMR cluster has an orchestrator node (AWS and Hadoop call them “master nodes”) and a group of core nodes (Hadoop calls them “worker nodes”). We request for on demand nodes for orchestrators and reserve the instances to reduce cost. We bid for spot instances for cores using a dynamic pricing strategy that is dependent on the current price. We have considered building a system that automatically switches instance types based on availability, price and stability in AWS but failures in spot bids are currently rare enough that it does not justify the cost of developing this feature.

We also monitor the resource utilisation of our spark jobs using Ganglia on AWS EMR. This tells us our CPU, memory, disk and network utilisation for our clusters. Since the information on Ganglia is lost when clusters are terminated, we run an EMR step to export a snapshot of Ganglia before the cluster terminates. This in conjunction with persisted spark history server data on AWS allows us to tune underperforming spark jobs. In a subsequent post, we will go into details of how to monitor your jobs effectively and tune them.

Monitoring the status of data pipeline jobs

Airflow creates EMR clusters and monitors each of the jobs. If a job fails, Airflow notifies us on a specific slack channel with links to the Airflow logs and AWS cluster.

Complex spark applications produce hundreds of megabytes of logs. These logs are distributed across the cluster and will be lost when the cluster is shut down. AWS EMR has an option to automatically copy the logs to S3 with a 2 minute delay.

We have tried using CloudWatch to index and analyse our spark logs but it was far too expensive. We also tried using a self hosted ELK stack but the cost of scaling it up for the volume of logs sent was too high. Dumping it on S3 and analysing it offline gave us the best cost to performance ratio.

To help reduce the time to fix an issue, when an issue is detected, the EMR cluster analyses its logs from YARN and publishes an extract onto slack as an attachment. Any further detailed analysis can be done on the logs in S3.

Monitoring data quality and data drift

Every time we write code, we run tests to ensure the code is safe to be deployed. Why don’t we do the same thing with data every time we access it?

When you first look at the data and build the model, you ensure the quality of the data used for training the model meets acceptable standards for your solution. Data Quality is measured by looking at the qualitative and quantitative pieces of your dataset. Over a period of time, these qualitative and quantitative attributes might drift causing adverse effects on your model. Thus, it is important to monitor your data quality and data drift. Data drift might be large enough that your model does not produce the right results any more or might be small enough to introduce a bias in your results. Monitoring these characteristics is key to producing accurate insights for your business.

Tools like Great Expectations and Deequ will ensure that your data is sound structurally and volumetrically. Deequ also has operators to look at rate of change of data which is a better expectation than having static thresholds on large volumes of data.

For example, given an employee salary database where the salary is nullable, a check to ensure no more than 100 employees out of the 1000 you currently have data for have no reported salary is bound to fail when the data volume increases significantly. If this check was to ensure no more than 10% of employees have no reported salary will work as the data scales as long as it scales evenly. Moving to a check that looks at rate of change of ratio of users not reporting a salary will be more robust. If the number changes significantly (up or down), it might mean that it’s time to tune your model since the source data is drifting away from when it was trained.

There are more complex examples on how we watch for data drift that will have to wait for a dedicated post.

The MLOps mindset

When our end users feel pain, we add new features to make their experience better. The same should be true for developers/operations experience (DevEx/OpsEx).

When it takes us longer to debug a problem or understand why a model did what it did, we improve our tooling and observability into our system. When it ran slower or was more expensive, we improved our observability to investigate inefficiencies quicker.

This has allowed us to grow our data platform 10x in terms of features and data volumes while reducing the time taken to produce insights for our end users by 98.75%, the cost to do so by 35% and not to mention a significant improvement in developer and customer experience.

Thanks to Jayant, Priyank, Anay and Trishna for reviewing drafts and providing early feedback. As always, Niki’s artwork wizardry is key!

Data storage patterns, versioning and partitions

2021-05-09T00:00:00+05:30

When you have large volumes of data, storing it logically helps users discover information and makes understanding the information easier. In this post, we talk about some of the techniques we use to do so in our application.

In this post, we are going to use the terminology of AWS S3 buckets to store information. The same techniques can be applied on other cloud, non cloud providers and bare metal servers. Most setups will include a high bandwidth low latency network attached storage with proximity to the processing cluster or disks on HDFS if the entire platform uses HDFS. Your mileage may vary based on your team’s setup and use case. We are also going to talk about techniques which have allowed us to efficiently process this information using Apache Spark as our processing engine. Similar techniques are available for other data processing engines.

Managing storage on disk

When you have large volumes of data, we have found it useful to separate data that comes in from the upstream providers (if any) from any insights we process and produce. This allows us to segregate access (different parts have different PII classifications) and apply different retention policies.

We would separate each of these datasets so it’s clear where each came from. When setting up the location to store your data, refer to local laws (like GDPR) for details on data residency requirements.

Provider buckets

Providers tend to make their own directories to send us data. This allows them to have access over how long they want to retain data or if they need to modify information. Data is rarely modified but when it is, a heads up is given to re-process information.

If this was an event driven system, we would have different event types suggesting that the data from an earlier date was modified. Since the volume of data is large and the batch nature of data transfer on our platform, verbal/written communication is preferred by our data providers which allows us to re-trigger our data pipelines for the affected days.

Landing bucket

Most data platforms either procure data or produce it internally. The usual mechanism is for a provider to write data into its own bucket and give its consumers (our platform) access. We copy the data into a landing bucket. This data is a full replica of what the provider gives us without any processing. Keeping data we received from the provider separate from data we process and insights we derive allows us to

Ensure that we don’t accidentally share raw data with others (we are contractually obligated not to share source data)
Apply different access policies to raw data when it contains any PII
Preserve an untouched copy of the source if we ever have to re-process the data (providers delete data from their bucket within a month or so)

Core bucket

The data in the landing bucket might be in a format sub optimal for processing (like CSV). The data might also be dirty. We take this opportunity to clean up the data and change the format to something more suitable for processing. For our use case, a downstream pipeline usually consumes a part of what the upstream pipeline produces. Since only a subset of the data is read downstream by a single job, using a file format that allows optimized columnar reads helped us boost performance and thus we use formats like ORC and parquet in our system. The output after this cleanup and transformation is written to the core bucket (since this data is clean input that’s optimised for further processing and thus core to the functioning of the platform).

While landing has an exact replica of what the data provider gave us, core’s raw data just transforms it to a more appropriate format (parquet/ORC for our use case) and processing applies some data cleanup strategies, adds meta-data and a few processed columns.

Derived bucket

Your data platform probably has multiple models running on top of the core data that produce multiple insights. We write the output for each of these into its own directory.

Advantages of data segregation

Separating the data makes it easier to find the data. When you have terabytes or petabytes of information across your organization with multiple teams working on this data platform, it becomes easy to lose track of the information that is already available and it can be hard to find it if they are stored in different places. Having some way to find information is helpful. For us, separating the data by whether we get it from an upstream system, we produce it or we send it out to a downstream system helps teams find information easily.
Different rules apply to different datasets. You might be obligated to delete data from raw information you have purchased under certain conditions (like when they have PII). Rules for retaining derived data are different if it does not contain any PII.
Most platforms allow archiving of data. Separating the dataset makes it easier to archive different datasets. (we’ll talk about other aspects of archiving during data partitioning)

Data partitioning

Partitioning is a technique that allows your processing engine (like Spark) to read data more efficiently thus making the program more efficient. The most optimal way to partition data is based on the way it is read, written and/or processed. Since most data is written once and read multiple times, optimising a dataset for reads makes sense.

We create a core bucket for each region we operate in (based on data residency laws of the area). For example, since the EU data cannot leave the EU, we create a derived-bucket in one of the regions in the EU. Under this bucket, we separate the data based on the country, the model that’s producing the data, a version of the data (based on its schema) and the date partition based on which the data was created.

Reading data from a path like derived-bucket/country=uk/model=alpha/version=1.0 will give you a data set with columns year, month and day. This is useful when you are looking for data across different dates. When filtering the data based on a certain month, frameworks like spark allow the use of push down predicates making reads more efficient.

Data versioning

We change the version of the data every time there is a breaking change. Our versioning strategy is similar to the one talked about in the book for Database Refactoring with a few changes for scale. The book talks about many types of refactoring and the column rename is a common and interesting use case.

Since the data volume is comparatively low in databases (megabytes to gigabytes), migrating everything to the latest schema is (comparatively) inexpensive. It is important to make sure the application is usable at all points and that there is no point at which the application is not usable.

Versioning on large data sets

When the data volume is high (think terabytes to petabytes), running migrations like this is a very expensive process in terms of the time and resources taken. Also, the application downtime during the migration is large or there’s 2 copies of the dataset created (which makes storage more expensive).

Non breaking schema changes

Let’s say you have a dataset that maps the real names to superhero names that you have written to model=superhero-identities/year=2021/month=05/day=01.

+--------------+-----------------+
|  real_name   | superhero_name  |
+--------------+-----------------+
| Tony Stark   | Iron Man        |
| Steve Rogers | Captain America |
+--------------+-----------------+

The next day, if you would like to add their home location, you can write the following data set to the directory day=02.

+------------------+----------------+--------------------------+
|    real_name     | superhero_name |      home_location       |
+------------------+----------------+--------------------------+
| Bruce Banner     | Hulk           | Dayton, Ohio             |
| Natasha Romanoff | Black Widow    | Stalingrad, Soviet Union |
+------------------+----------------+--------------------------+

Soon after, you realize that storing the real name is too risky. The data you have already published was public knowledge but moving forward, you would like to stop publishing real names. Thus on day=03, you remove the real_name column.

+----------------+---------------------------+
| superhero_name |       home_location       |
+----------------+---------------------------+
| Spider-Man     | Queens, New York          |
| Ant-Man        | San Francisco, California |
+----------------+---------------------------+

When you read derived-bucket/country=uk/model=superhero-identities/ using spark, the framework will read the first schema and use it to read the entire dataset. As a result, you do not see the new home_location column.

scala> spark.read.
  parquet("model=superhero-identities").
  show()
+----------------+---------------+----+-----+---+
|       real_name| superhero_name|year|month|day|
+----------------+---------------+----+-----+---+
|Natasha Romanoff|    Black Widow|2021|    5|  2|
|    Bruce Banner|           Hulk|2021|    5|  2|
|            null|        Ant-Man|2021|    5|  3|
|            null|     Spider-Man|2021|    5|  3|
|    Steve Rogers|Captain America|2021|    5|  1|
|      Tony Stark|       Iron Man|2021|    5|  1|
+----------------+---------------+----+-----+---+

Asking Spark to merge the schema for you shows all columns (with missing values shown as null)

scala> spark.read.option("mergeSchema", "true").
  parquet("model=superhero-identities").
  show()
+----------------+---------------+--------------------+----+-----+---+
|       real_name| superhero_name|       home_location|year|month|day|
+----------------+---------------+--------------------+----+-----+---+
|Natasha Romanoff|    Black Widow|Stalingrad, Sovie...|2021|    5|  2|
|    Bruce Banner|           Hulk|        Dayton, Ohio|2021|    5|  2|
|            null|        Ant-Man|San Francisco, Ca...|2021|    5|  3|
|            null|     Spider-Man|    Queens, New York|2021|    5|  3|
|    Steve Rogers|Captain America|                null|2021|    5|  1|
|      Tony Stark|       Iron Man|                null|2021|    5|  1|
+----------------+---------------+--------------------+----+-----+---+

As your model’s schema evolves, using features like merge schema allows you to read the available data across various partitions and then process it. While we have showcased spark’s abilities to merge schemas for parquet files, such capabilities are also available with other file formats.

Breaking changes or parallel runs

Sometimes, you evolve and improve your model. It is useful to do parallel runs and compare the result to verify that it is indeed better before the business switches to use the newer version.

In such cases we bump up the version of the solution. Let’s assume job alpha v1.0.36 writes to the directory derived-bucket/country=uk/model=alpha/version=1.0. When we have a newer version of the model (that either has a very different schema or has to be run in parallel), we bump the version of the job (and the location it writes to) to 2.0 making the job alpha v2.0.0 and it’s output directory derived-bucket/country=uk/model=alpha/version=2.0.

If this change was made and deployed on 1st of Feb and this job runs daily, the latest date partition under model=alpha/version=1.0 will be year=2020/month=01/day=31. From the 1st of Feb, all data will be written to the model=alpha/version=2.0 directory. If the data in version 2.0 is not sufficient for the business on 1st Feb, we either run backfill jobs to get more data under this partition or we run both version 1 and 2 until version 2’s data is ready to be used by the business.

The version on disk represents the version of the schema and can be matched up with the versioning of the artifact when using Semantic Versioning.

Advantages

Each version partition on disk has the same schema (making reads easier)
Downstream systems can choose when to migrate from one version to another
A new version can be tested out without affecting the existing data pipeline chain

Summary

Applications, system architecture and your data always evolve. Your decisions in how you store and access your data affect your system’s ability to evolve. Using techniques like versioning and partitioning helps your system continue to evolve with minimal overhead cost. Thus, we recommend integrating these techniques into your product at its inception so the team has a strong foundation to build upon.

Thanks to Sanjoy, Anay Sathish, Jayant and Priyank for their draft reviews and early feedback. Thanks to Niki for using her artwork wizardry skills.

Version controlled configuration and secrets management for Terraform

2019-08-26T00:00:00+05:30

Terraform is a tool to build your infrastructure as code. We’ve been having a few challenges while trying to figure out how to how to manage configuration and secrets when integrating terraform with our CD pipeline.

Life before version control

Before we can do that, it’s important to understand build process before we began on this journey.

Our build model for this project was branch based. Each environment maps to a branch (main -> dev, uat -> uat and production -> production). All other (feature) branches only ran the plan stage against the dev environment.

As you can notice, the configurations, secrets and keys are all maintained on the build agent. This means, every developer wanting to run plan and test their changes needs to replicate the terraform_variables directory. Any mistakes in doing so masks actual issues that your pipeline might face leading to delayed feedback.

Next, let’s look at what our codebase looked like

terraform
├── module-1
│   ├── backend.tf
│   ├── data.tf
│   ├── resources.tf
│   ├── provider.tf
│   └── variables.tf
├── module-2
│   ├── backend.tf
│   ├── data.tf
│   ├── resources.tf
│   ├── provider.tf
│   └── variables.tf
└── scripts
    └── provision
        ├── apply.sh
        ├── init.sh
        └── plan.sh

The provisioning scripts help us consistently run different stages across modules. Each module is an independent area of our infrastructure (such as core networking, HTTP services etc.)

Each of the provisioning scripts accepted a WORKSPACE_NAME (branch for execution that maps to the environment terraform is running for) and MODULE_NAME (module being executed).

init.sh ran the terraform init stage of the pipeline downloading the necessary plugins and initializing the backend

#!/bin/bash set -e cd $MODULE_NAME echo "init default.tfstate" terraform init -backend-config="key=default.tfstate" echo "select or create new workspace $WORKSPACE_NAME" terraform workspace select $WORKSPACE_NAME || terraform workspace new $WORKSPACE_NAME echo "init $MODULE_NAME/terraform.tfstate" terraform init -backend-config="key=$MODULE_NAME/terraform.tfstate" -force-copy -reconfigure

plan.sh ran the terraform plan stage allowing users to review their changes before applying them.

#!/bin/bash set -e cd $MODULE_NAME echo "select or create new workspace $WORKSPACE_NAME" terraform workspace select $WORKSPACE_NAME || terraform workspace new $WORKSPACE_NAME echo "plan with var file ~/terraform_variables/$WORKSPACE_NAME/$MODULE_NAME.tfvars" terraform plan -var-file=~/terraform_variables/$WORKSPACE_NAME/$MODULE_NAME.tfvars -out=$MODULE_NAME.tfplan -input=false

apply.sh applied the changes onto an environment. Developers do not run this command from local to ensure consistency on the environment

#!/bin/bash set -e cd $MODULE_NAME echo "select or create new workspace $WORKSPACE_NAME" terraform workspace select $WORKSPACE_NAME || terraform workspace new $WORKSPACE_NAME echo "apply with var file ~/terraform_variables/$WORKSPACE_NAME/$MODULE_NAME.tfvars" terraform apply -var-file=~/terraform_variables/$WORKSPACE_NAME/$MODULE_NAME.tfvars -auto-approve

Version controlling configuration

We moved the variables into the config directory by making a directory for every branch for each of the 3 environments we had.

terraform
├── config
│   ├── main
│   │   ├── module-1.tfvars
│   │   └── module-2.tfvars
│   ├── production
│   │   ├── module-1.tfvars
│   │   └── module-2.tfvars
│   ├── uat
│   │   ├── module-1.tfvars
│   │   └── module-2.tfvars
├── module-1
│   └── ...
├── module-2
|   └── ...
└── scripts
    ├── provision
    │   ├── apply.sh
    │   ├── functions.sh
    │   ├── init.sh
    │   └── plan.sh
    └── test_variable_names.sh

According to terraform’s documentation, you can export a variable that your terraform codes need with a prefix of TF_VAR.

functions.sh provides convenience functions to read the configuration and secrets.

#!/bin/bash function fetch_variables() { workspace_name=$1 module_name=$2 echo $(cat ../config/$workspace_name/$module_name.tfvars | sed '/^$/D' | sed 's/.*/TF_VAR_& /' | tr -d '\n') }

fetch_variables read the tfvars file, removes empty lines (that were added for readability), prefixed the name with TF_VAR and joined all entries into a single line. The string this method returns can be used as a prefix to the terraform command while running plan and apply making them environment variables.

Updated plan and apply scripts are placed in the secrets management section for brevity

Testing configuration files

The only limitation is that none of these variables can have a hyphen in the name because of shell variable naming rules. As with any potential mistake, a test providing feedback helps protect you from run time failures. test_variable_names.sh does this check for us.

#!/bin/bash function parse_and_test_properties_entries() { prop=$1 if [[ "$prop" == "" || $prop = \#* ]]; then continue fi key="$(cut -d'=' -f1 <<<"$prop")" if [[ $key =~ "-" ]]; then echo "$filename contains \"$key\" which contains a hyphen" exit 1 fi } function parse_file() { filename=$1 OLD_IFS=$IFS props=$(cat $filename) IFS=$'\n' for prop in ${props[@]}; do parse_and_test_properties_entries $prop done IFS=$OLD_IFS } base_dir="config" for sub_dir in $(find $base_dir -mindepth 1 -maxdepth 1 -type d); do workspace_name=${sub_dir#"$base_dir/"} for input_file in config/$workspace_name/*.tfvars; do parse_file $input_file done echo "All variables are named correctly in config/$workspace_name" done

Version controlling secrets

Secrets like passwords can be version controlled in a similar way though they require encryption to keep them safe. We’re using OpenSSL with a symmetric key to encrypt our secrets. Each secret is put into a tfsecrets file (internally a property file just like tfvars files for configuration). When encrypted, the file will have an extension of .tfsecrets.enc. When the plan or apply stages are executed, files are decrypted in memory (and not on disk, for security reasons) and used the same way.

functions.sh gets a new addition to support reading all secrets

function fetch_secrets() { workspace_name=$1 module_name=$2 secret_key_for_workspace=$(eval "echo \$SECRET_KEY_$workspace_name") echo $(openssl enc -aes-256-cbc -d -in ../config/$workspace_name/$module_name.tfsecrets.enc -pass pass:$secret_key_for_workspace | sed '/^$/D' | sed 's/.*/TF_VAR_& /' | tr -d '\n') }

The astute amongst you probably noticed that we’re using OpenSSL v1.0.2s because v1.1.x changes the syntax on encryption/decryption of files. Also, you might have noticed the use of environment variables like SECRET_KEY_main, SECRET_KEY_uat and SECRET_KEY_production as the encryption keys. These values are stored on our CI server (in our case GitLab) which makes these values available to our CI agent during execution.

For local development, we have scripts to encrypt and decrypt configuration files either one at a time or in bulk per environment. It’s worth noting that re-encryption of the same file will show up on your git diff since the encrypted file’s metadata changes. Only check in encrypted files when their contents have changed (helping you debug future issues)

encrypt.sh takes SECRET_KEY as an environment variable for making local usage easier.

#!/bin/bash set -e if [ -z "$SECRET_KEY" ]; then echo "Set a SECRET_KEY for \"$WORKSPACE_NAME\" encryption" exit 1 fi function encrypt_file() { input_file=$1 target_file="$input_file.enc" echo "Encrypting $input_file to $target_file" openssl enc -aes-256-cbc -salt -in $input_file -out $target_file -pass pass:$SECRET_KEY rm -f $input_file } if [ -z $1 ]; then echo "Usage:" echo " ./scripts/encrypt.sh " echo " ./scripts/encrypt.sh all" exit 2 elif [ "$1" == "all" ]; then for input_file in config/$WORKSPACE_NAME/*.tfsecrets; do encrypt_file $input_file done else encrypt_file $1 fi

decrypt.sh also takes the same SECRET_KEY as an environment variable for making local usage easier.

#!/bin/bash set -e if [ -z "$SECRET_KEY" ]; then echo "Set a SECRET_KEY for \"$WORKSPACE_NAME\" decryption" exit 1 fi function decrypt_file() { input_file=$1 target_file=${input_file%".enc"} echo "Decrypting $input_file to $target_file" openssl enc -aes-256-cbc -d -in $input_file -out $target_file -pass pass:$SECRET_KEY rm -f $input_file } if [ -z $1 ]; then echo "Usage:" echo " ./scripts/decrypt.sh " echo " ./scripts/decrypt.sh all" exit 2 elif [ "$1" == "all" ]; then for input_file in config/$WORKSPACE_NAME/*.tfsecrets.enc do decrypt_file $input_file done else decrypt_file $1 fi

Testing secret files

If all files for an environment aren’t checked with the same key, you’ll face a runtime error. Since files can be encrypted individually, you must test if all files have been encrypted correctly. This test is also useful when you’re rotating the SECRET_KEY for an environment.

test_encryption.sh needs SECRET_KEY_ values set so it can be executed locally.

#!/bin/bash base_dir="config" for sub_dir in $(find $base_dir -mindepth 1 -maxdepth 1 -type d); do workspace_name=${sub_dir#"$base_dir/"} password_var_name="\$SECRET_KEY_$workspace_name" secret_key_for_workspace=$(eval "echo $password_var_name") if [ -z "$secret_key_for_workspace" ]; then echo "Variable $password_var_name has not been set. Unable to test" exit 1 fi for input_file in config/$workspace_name/*.tfsecrets.enc do openssl enc -aes-256-cbc -d -in $input_file -pass pass:$secret_key_for_workspace &> /dev/null if [ $? != 0 ]; then echo "Unable to decrypt $input_file with $password_var_name" exit 1 fi done echo "Successfully decrypted all secrets in config/$workspace_name" done

End result

Our final project structure contains the following files

terraform
├── config
│   ├── main
│   │   ├── module-1.tfvars
│   │   ├── module-1.tfsecrets.enc
│   │   ├── module-2.tfvars
│   │   └── module-2.tfsecrets.enc
│   ├── production
│   │   ├── module-1.tfvars
│   │   ├── module-1.tfsecrets.enc
│   │   ├── module-2.tfvars
│   │   └── module-2.tfsecrets.enc
│   ├── uat
│   │   ├── module-1.tfvars
│   │   ├── module-1.tfsecrets.enc
│   │   ├── module-2.tfvars
│   │   └── module-2.tfsecrets.enc
├── module-1
│   └── ...
├── module-2
|   └── ...
└── scripts
    ├── decrypt.sh
    ├── encrypt.sh
    ├── provision
    │   ├── apply.sh
    │   ├── functions.sh
    │   ├── init.sh
    │   └── plan.sh
    ├── test_encryption.sh
    └── test_variable_names.sh

plan.sh uses functions.sh to load configuration and secrets

#!/bin/bash set -e source $(dirname "$0")/functions.sh cd $MODULE_NAME echo "select or create new workspace $WORKSPACE_NAME" terraform workspace select $WORKSPACE_NAME || terraform workspace new $WORKSPACE_NAME echo "plan with var file config/$WORKSPACE_NAME/$MODULE_NAME.tfvars" config=$(fetch_variables $WORKSPACE_NAME $MODULE_NAME) secrets=$(fetch_secrets $WORKSPACE_NAME $MODULE_NAME) eval "$secrets $config terraform plan -out=$MODULE_NAME.tfplan -input=false"

apply.sh uses functions.sh in a similar fashion

#!/bin/bash set -e source $(dirname "$0")/functions.sh cd $MODULE_NAME echo "select or create new workspace $WORKSPACE_NAME" terraform workspace select $WORKSPACE_NAME || terraform workspace new $WORKSPACE_NAME echo "apply with var file config/$WORKSPACE_NAME/$MODULE_NAME.tfvars" config=$(fetch_variables $WORKSPACE_NAME $MODULE_NAME) secrets=$(fetch_secrets $WORKSPACE_NAME $MODULE_NAME) eval "$secrets $config terraform apply -auto-approve"

And thus, our terraform project requires no data from the CI agent and can be executed perfectly from any box as long as it has the latest code checked out and the correct version of terraform.

Managing multiple signatures for git repositories

2019-06-11T00:00:00+05:30

Github explains pretty well how to sign commits. You can make it automatic by globally setting commit.gpgsign = true by using

git config --global commit.gpgsign true

What if you have different signatures for your personal ID and your work ID?

First, you create multiple signatures. It is important that the email address in the signature is the same as the one for the user who has authored the commit. Run gpg -K --keyid-format SHORT to see all available keys. The output looks like

/Users/karun/.gnupg/pubring.kbx
-------------------------------
sec   rsa4096/11111111 2019-06-11 [SC]
      1234567890123456789012345678901211111111
uid         [ultimate] Karun Japhet 
ssb   rsa4096/22222222 2019-06-11 [E]

sec   rsa4096/33333333 2019-06-11 [SC]
      0987654321098765432109876543210933333333
uid         [ultimate] Karun Japhet 
ssb   rsa4096/44444444 2019-06-11 [E]

Fetch the ID for each of the signatures. The ID for the personal signature is 11111111 and that for the work signature is 33333333. To assign a signature to the repo, execute git config user.signingkey .

Personally, I have aliases for personal and work signatures and every time I checkout a project, run the alias once.

alias signpersonal= "git config user.signingkey 11111111 && git config user.email \"karun@personal.com\""
alias signwork    = "git config user.signingkey 33333333 && git config user.email \"karun@work.com\""

Run git log --show-signature to verify if a commit used the right signature. Happy commit-signing.

Fixing broken Social logins on your browser

2019-04-16T00:00:00+05:30

Privacy vs Convienience is a constant battle. Personally, I prefer dialing up my privacy up to 11 to avoid being tracked. Every once in a while, social logins are important because it’s the only way to use a service. If this service is an internal company login that only uses social login via the company’s Google ID, you don’t have much of a chance.

If your login just won’t work, try changing the following settings

Privacy Badger

Allow calls to accounts.google.com & apis.google.com

Firefox settings

Allow Third party trackers in Firefox through Settings > Privacy & Security > Cookies > Third-party trackers

The untold guide to troubleshoot Phillips Hue and Google Assistant Integration

2018-11-10T00:00:00+05:30

Recently, I moved into a new home and was setting up my Phillips Hue lights with my Google Home assistants around my house for convenience. I noticed a couple of hick-ups since the last time I did this.

Logging into Phillips Hue app

If the app does not ask you to hit the button on your bridge, your account already has a bridge associated with it.

You can see what bridge is associated with your MeetHue account on the bridges page.

Remove any older bridges you might have on your account and try logging into the Phillips Hue app again. Once complete, you should be able to link your Google Home assistant to your Phillips Hue app.

Other House Keeping for security reasons

You can cleanup how many apps have access to your account and how many other users have access to your bridge. If you see anything that your don’t recognize, remove it. After all, these apps and Hue account users can control the lights in your house. If you don’t know them, remove their access.

In my case, the only users on my bridge are the family members in my house and the only apps I have are the Phillips Hue Android app (for mobile access remotely) and Google (assistant integration).

Efficient logback logging on JVM

2018-11-01T00:00:00+05:30

Efficient logging that doesn’t bring your application down is simple to setup but is often overlooked. Here are some quick tips on how to achieve exactly that

Async Logging

Most applications these days should have a single (console) appender. This can be linked up with your log aggregator of choice. If your application cannot aggregate logs off the console stream, file is your next best alternative.

Wrap each of your appenders with an async appender and add the async appender to your root logger.

Every call to the logger creates a log event. In synchronous logging, that log event was processed and writes were made to all appender streams before the application continued. Since most stream writes involve I/O, this meant the application would wait for I/O before continuining thereby slowing it down. With async logging, the event gets pushed to a log level specific in memory queue. These events are processed and consumed by the appenders asynchronously. Since the application can continue after a log event has been published to the queue, asynchronous logging works quicker (as long as I/O is the long pole in the tent that is publishing log messages)

Here’s a sample configuration:


  
    myapp.log
    
      %logger{35} - %msg%n
    
  

  
    
    1024
    false

Every queue has a configurable depth. The depth of the queue is based on how much memory you have and expected ratio in rates of messages coming in through the application and the messages being published through the I/O bottleneck.

If you hit max queue depth on either the WARN or ERROR queues, further statements for those levels become synchronous.

If you hit more than 80% of the max queue depth on any other level, the system will start dropping log statements (due to discardingThreshold=20 by default and neverBlock=true). Therefore, under high load, you can lose INFO, DEBUG and TRACE log messages. This behaviour is acceptable for most cases except specific critical statements (like audit logs). For such cases, you can add asynchronous appenders that are allowed to block.

The percentage of depth after which messages are dropped is configurable. You can make info/debug logs synchronous at 100% too if needed by changing the neverBlock=false (which is the default behaviour).

All of this information is available on logback’s documentation.

Writing log statements

Async logs only work more efficiently because the production of events is synchronous (and hopefully a quick task) and the processing of events (which requires IO) is a slow task.

However if production of log messages takes long time, async logging will not make things better. When you’re printing a large amount of data or if the creation of the log message is an expensive operation, use the following kind of log statement

// style 1: java string interpolation; inefficient and hard to read :P
logger.info("Large object value was " + largeObject1 + " and long operation printed " + largeObject2.longOperation())
// style 2: scala string interpolation; inefficient but easy to read
logger.info(s"Large object value was $largeObject1 and long operation printed ${largeObject2.longOperation()}")
// style 3: logback based string interpolation; efficient but inconvenient to read
logger.info("Large object value was {} and long operation printed {}", largeObject1, largeObject2.longOperation())

While the scala interpolation (style 2) is the easiest to read, we should only do it when the objects being printed are small (small-ish strings or primitives).

Rule of thumb:

For quick statements, use style 2.
For large statements, use style 3 (sacrifices readability for efficiency)
Never use style 1 :P

Using LazyLogging as opposed to creating loggers yourself

Use lazy logging. It internally uses loggers that wraps yours code (during compile time) with if checks to not process log statements if the specific log level doesn’t need to be printed (using macros). Worried about performance due to extra if conditions? You shouldn’t. Modern processors contain black magic called branch prediction that reduce the effect of statements such as this to be effectively nothing.

IMO, every scala project should use lazy logging. It’s light on dependencies and has a nice implementation that makes your logging more efficient run faster for fractionally slower compilation.

The Science in the Art of the Showcase (for distributed teams)

2018-07-03T00:00:00+05:30

Showcases are a key part of our agile ceremonies. We showcase our work to our stakeholders for feedback at the end of every iteration. And as with every presentation, I believe there is a Science in the Art of the Showcase (for distributed teams).

On one of our recent teams, our showcases had challenges. Each of these challenges is a piece of feedback. We added structure to our showcases by running it like a theatre recording TV shows.

This isn’t revolutionary stuff. This is an attempt at defining a structure that should make it easier to organize showcases based off a check-list.

Role

The Master of Ceremonies

The MC is the face of the operations. They are responsible to dessiminate information and keep the crowd engaged. This means that the person should have context about what goes on and how to handle the different failures around client infra (skype issues, VDI issues etc).

Best practices for MCs

Running commentary: Always keep speaking. Is there an issue? Keep the show rolling. Be transparent. Your support (folks below) will keep feeding you information when necessary.

The Stagehand

This is the magician that controls the lighting on stage. This person actually runs the slides and the demos ensuring everything is smooth

Best practices for Stagehands

Practice your demos repeatedly till it’s muscle memory
Ensure the demo windows are already prepared with data entry. Avoid copy pasting unless it cannot be avoided.
Ensure the content on the screen is visible on the media the stakeholders consume it on. If the stakeholders get together in a room and look at a screen projected on the screen or a big TV, please ensure that the font size is readable.

The Conductor

This is the person who runs the show. This person is responsible to stay on the demo co-ordination chat and spot issues and handle them before they become a thing. This person is also responsible to give instant feedback to people running the showcase when needed.

Best practices for The Conductor

Ensure you have an eye on the demo co-ordination chat. Delegate replying to another window if required
Ensure the MC is providing running commentary
Step in only if it is absolutely required
Keep an eye out for schedule

The Theatre Tech

The person who watches the logs and statuses for the services involved in the demo. If there is anything going wrong, talk to the conductor immediately.

Best practices for the Theatre Techs

Have appropriate windows ready to perform the tasks you might need to in a hurry (bouncing services)
Have windows showing instance health
Have log window opens

The Timekeeper

This person is in the room (with clients) and is responsible to keep time. If the discussion goes off, it is your responsibility to cut the discussion off and setup a followup discussion.

If the clients are in multiple locations, have a timekeeper per location. Might be the conductor when available in a location.

The Scribes

Multiple people taking notes and sharing them after the demo. They are responsible to pick up body queues from the people around them and take notes on follow up discussions that we need to have.

Best practices for The Scribe

Be active on a demo co-ordination chat channel and provide instantaneous feedback from different locations. This helps the conductor get more information and is key to their effectiveness.

The Playwright

This person is primarily responsible for the content of the showcase.

The content of a showcase should be like a TV show. A major milestone/deliverable is like a season and should have an overarching story (aka narrative arc). Each showcase is like an episode and should have a subsection of the narrative arc.

The way Presentation Patterns book describes narrative arcs in presentations is true about showcases

~~Presentations~~ Showcases are a form of storytelling; don’t ignore a few thousand years of oratory history. A Narrative Arc is a common trope; organizing your ~~presentation~~ showcase in a similar way leverages your audience’s lifetime of story listening experience.

Execution

Know the people on your team. Identify which team members can do what roles. Invest in and groom people for roles based on their interest, it’s a growth opportunity.

Prep work for the venue

If your client site requires you to book rooms, do so as far out in advance as possible.
If your team is distributed, make sure the room has a good VC with a computer you can use to run the demo. Ensure your laptop can easily connect to the VC equipment in the room.
Know your venue and plan your seating. Presenters closer to the screen. Stakeholders in clean view of the screen and the presenters.

Prep on the day

Sign up for roles based on your skills
Do multiple dry runs
Show up to the room 20 minutes before the start of the meeting. Set it up.

Upgrade everything in brew

2017-10-19T00:00:00+05:30

Homebrew is a the missing package manager for Mac OS. Brew cask extends Homebrew and brings its elegance, simplicity, and speed to Mac OS applications and large binaries alike.

If you’re using these tools and would like to upgrade all of the applications you have, run the following command.

brew update && brew upgrade && (brew cask outdated | cut -f 1 -d " " | xargs brew cask reinstall) && brew cleanup

Breaking it down

Update brew with information from the latest taps: brew update
Upgrade apps in brew: brew upgrade
Update brew cask apps: brew cask outdated | cut -f 1 -d " " | xargs brew cask reinstall
Find outdated cask apps: brew cask outdated
Cut out the app names: cut -f 1 -d " "
Upgrade brew cask apps: xargs brew cask reinstall
Remove installers for brew apps (to release disk space): brew cleanup

Note: brew cask cleanup is now deprecated.

Lombok usage in large enterprises

2017-08-13T00:00:00+05:30

Verbosity of Java

Java is a verbose language. No one disputes it.

Despite the clunky nature of the language syntax, it still is the language of choice in most enterprises. If you work in the services industry or are a technology consultant, chances are that you have to work with Java on a regular basis.

If you’re also a fan of functional programming language and have worked any modern programming language, you’ll recognize that Java’s syntax hinders your productivity because of the large amounts of boilerplate the language will generate. While newer JVM based lanaguages like Kotlin solve these problems in different ways, the open source community created Project Lombok to provide similar syntactic sugar in the world’s most popular enterprise programming language.

What is Lombok?

Lombok is a Java dependency that uses Java annotations to generate byte code straight into the class files during the compilation phase there by allowing the boilerplate code from your codebase to be significantly reduced.

An example from the Software Engineering Trends post from Jan 2010 shows

@Data(staticConstructor="of")
public class Company {
  private final Person founder;
  private String name;
  private List employees;
}

would generate the same code as

public class Company {
  private final Person founder;
  private String name;
  private List employees;

  private Company(final Person founder) {
    this.founder = founder;
  }

  public static Company of(final Person founder) {
    return new Company(founder);
  }

  public Person getFounder() {
    return founder;
  }

  public String getName() {
    return name;
  }

  public void setName(final String name) {
    this.name = name;
  }

  public List getEmployees() {
    return employees;
  }

  public void setEmployees(final List employees) {
    this.employees = employees;
  }

  @java.lang.Override
  public boolean equals(final java.lang.Object o) {
    if (o == this) return true;
    if (o == null) return false;
    if (o.getClass() != this.getClass()) return false;
    final Company other = (Company)o;
    if (this.founder == null ? other.founder != null : !this.founder.equals(other.founder)) return false;
    if (this.name == null ? other.name != null : !this.name.equals(other.name)) return false;
    if (this.employees == null ? other.employees != null : !this.employees.equals(other.employees)) return false;
    return true;
  }

  @java.lang.Override
  public int hashCode() {
    final int PRIME = 31;
    int result = 1;
    result = result * PRIME + (this.founder == null ? 0 : this.founder.hashCode());
    result = result * PRIME + (this.name == null ? 0 : this.name.hashCode());
    result = result * PRIME + (this.employees == null ? 0 : this.employees.hashCode());
    return result;
  }

  @java.lang.Override
  public java.lang.String toString() {
      return "Company(founder=" + founder + ", name=" + name + ", employees=" + employees + ")";
  }
}

The good

You shouldn’t have to write code that can be generated automatically. Of course, modern IDEs will do this for you with a few clicks of the keyboard

We’re trying to optimize more than a few clicks though. Have a look at the equals method below:

@java.lang.Override
  public boolean equals(final java.lang.Object o) {
    if (o == this) return true;
    if (o == null) return false;
    if (o.getClass() != this.getClass()) return false;
    final Company other = (Company)o;
    if (this.founder == null ? other.founder != null : !this.founder.equals(other.founder)) return false;
    if (this.name == null ? other.name != null : !this.name.equals(other.name)) return false;
    if (this.employees == null ? other.employees != null : !this.employees.equals(other.employees)) return false;
    return true;
  }

Is this a standard equals method (one where every field in the class is checked for equality)? Did we skip a field? Did we do a non standard check on one of the fields? Unless you go through the method line by line, there is no way to know.

Generating code saves you the hassle of checking. If there is an annotation, you know the what the implementation will be (assuming you know how the framework works). If there’s code, chances are that it’s a non-standard implementation (or someone made a mistake).

The bad

If you wish to check the generated code, you need an IDE that decompiles byte code or a tool that does the same.

If something’s wonky, debugging the issue might not be straight forward.

The downright ugly

Modern IDEs like IntelliJ are built for refactoring. One of the most common refactoring options is the option to Change Signature. It’s an extremely useful option that allows you to reorder method (or constructor) parameters and the IDE takes care of the appropriate changes throughout the codebase.

The order of the constructor parameters in a lombok-fied class is the order in which the parameters are declared. Changing this order changes the constructor signature.

For a class with different parameter types, this is not a problem. Refactoring the following class

@Data
public class Company {
  private final Person founder;
  private String name;
  private List employees;
}

to the following signature

@Data
public class Company {
  private final Person founder;
  private List employees;
  private String name;
}

is not a problem. The usage of the constructor will fail to compile and provide feedback.

If you have primitive types in your lombok-fied class, you have a problem. Refactoring the following class

@Data
public class Person {
  private final String employeeId;
  private final String firstName;
  private final String lastName;
}

to the following signature

@Data
public class Person {
  private final String firstName;
  private final String lastName;
  private final String employeeId;
}

will provide no feedback. The code will compile and set employeeIds to firstNames, firstNames to lastNames and lastNames to employeeIds. If you don’t have tests on the behavior of the Person class, you won’t notice this issue until it’s too late. Hopefully, you don’t have tests for a data container with no behavior.

Where is Lombok appropriate?

Do you have a project where you have a strict set of contributors?
- because you’ll have to walk them through the rules of appropriate usage of lombok
Do your contributors understand Lombok well and how it works?
- because you will have unexpected defects due to refactoring if they don’t
Do your contributors understand how to properly unit test and do they understand the automation test pyramid?
- appropriate high level testing could catch functional defects. you don’t want unit tests checking constructors and getters
Do you have strict code quality control?
- without a way to check for inappropriate usage of lombok, defects can very easily creep in
Is the team willing to invest time and effort into training new team members about Lombok and potential downsides?
- your learnings have to be passed to every future member of the codebase
Do most of your models use value objects and avoid primitives?
- because reordering non-primitive fields will lead to compile exceptions providing feedback

If your team can answer yes to all of the above, you should use Lombok.

I must admit, most large teams can’t answer yes to all of the questions. Have you considered using Kotlin instead? :)

Hosting blogs for 1¢ a month

2016-12-05T02:04:56+05:30

If you’re a dev and you self host your blog, I’d love to hear why. Why do you self host blogs? For most simple blogs in this day and age, migration to a static site like Jekyll or Octopress is pretty easy. I did this a while back. This can be followed up by asking Amazon S3 to host your website. You can even get cloudflare to front the SSL for free.

Why? S3 is free for the first year. Even post that period, my bills have been <$0.02/month which is a 99.951% reduction in cost.

Continuous delivery

Snap CI is will integrate with your publically accessible GitHub repositories for free and trigger builds on commit! Connect to your github repository and get it to compile your markdown into html. Deploying to S3 is a piece of cake. Congratulations on having continuous delivery for your blog!

Cloudflare provides the free SSL and Amazon S3 provides the near free hosting. A few cents a month to host your entire website is a good deal!

I’ve been on S3 for a year now and I couldn’t be happier!

** Goodbye Servers, Welcome S3 **

Commonly made mistakes in Unit Testing

2016-02-28T09:24:50+05:30

What is Unit Testing?

Unit testing is all about focusing on one element of the software at a time. This unit is called the often called the ‘System Under Test’ (refer Mocks Aren’t Stubbs). In order to test only one unit at a time, all other units need to not be test at the same time. As obvious as that sounds, it’s easy to miss.

Classes do not exist independent of one another. They usually have dependencies. Such dependencies are called the ‘Collaborators’. There are multiple ways to manage collaborators that have been talked about by Martin in his article.

Pre-requisites to the post before going forward

Before we go on, please ensure you’ve read through Mocks Aren’t Stubbs by Martin Fowler. This post assumes that you’ve gone through the article before continuing on to commonly made mistakes in Unit Testing

Mocks vs Actual Implementations

Consider a board game where the Board class runs the game with the help of it’s collaborators Player and Dice.

public class Board {
  private final List players;
  private final Dice dice;
  private int currentPlayerId;

  public Board(final Dice dice, final Player... players) {
    this.currentPlayerId = 0;
    this.dice = dice;
    this.players = Arrays.asList(players);
  }

  public void playMove() {
    players.get(currentPlayerId).move(dice.roll());
    currentPlayerId = evaluateNextPlayerId();
  }

  public Player getCurrentPlayer() {
    return players.get(currentPlayerId);
  }

  private int evaluateNextPlayerId() {
    return currentPlayerId + 1 < players.size() ? currentPlayerId + 1 : 0;
  }
}

class Dice {
  public int roll() {
    return (int) Math.round(Math.random() * 6);
  }
}

class Player {
  @lombok.Getter
  private int position;

  public int move(final int moveCount) {
    position += moveCount;
  }
}

If we consider the Board to be the System Under Test, the most tempting trap to fall into is start testing the Board directly.

public class BoardTest {
  @Test
  public void shouldMovePlayerForCorrectPlayer() {
    final Dice dice = new Dice();
    final Player player1 = new Player();
    final Player player2 = new Player();
    final Board board = new Board(dice, player1, player2);

    board.playMove();

    assertThat(player1.getPosition(), greaterThan(0));
    assertThat(player2.getPosition(), is(0));
    assertThat(board.getCurrentPlayer(), is(player2));
  }
}

This is not the greatest example but it does attempt to show you the coupling between the different components. Player1’s current position isn’t predictable since it’s coupling with dice. The dependency also means that if the dice has defects, the board can’t be tested appropriately.

By swapping out player and dice instances with mocks, we have the ability to only test the board independent of potential issues with the dependencies.

The above test can be refactored to look like

public class BoardTest {
  @Test
  public void shouldMovePlayerForCorrectPlayer() {
    final Dice dice = mock(Dice.class);
    final Player player1 = mock(Player.class);
    final Player player2 = mock(Player.class);
    final Board board = new Board(dice, player1, player2);

    when(dice.roll()).thenReturn(3);
    when(player1.move(3)).thenReturn(3);

    board.playMove();

    verify(player1).move(3);
    verify(player2, never()).move(anyInt());
    assertThat(board.getCurrentPlayer(), is(player2));
  }
}

The test now allows you to check if player1 was moved 3 places since the response provided by the dice is in your control. Mocks also allow you to test that player2 was not called.

This becomes even more important in an example where the response from the mock affects the system under test. Controlling the mock allows you to control predict the end state of the system under test with the assumption that your mock setup is correct. These assumptions can be validated with the spec for the individual mocks. The unit test for dice mock can confirm that the dice only returns values between 1 and 6 (inclusive).

Testing inside the boundaries

Every functionality should be tested within it’s boundaries. Let’s take the Dice class as an example and talk about what this means.

Typically a dice produces values between 1 and 6.

It’s corresponding test has to prove that rolling a dice always results in a value between 1-6.

@Test
public void shouldRollValidNumberOnDice() {
  assertThat(new Dice().roll(), isOneOf(1, 2, 3, 4, 5, 6));
}

This test proves that the value is inside the range but does not prove that it will always be in that range. Since the implementation contains a PRNG, the end result cannot be predicted.

Most readers wouldn’t have noticed the defect in the implementation.

class Dice {
  public int roll() {
    return (int) Math.round(Math.random() * 6);
  }
}

The implementation can produce values 0-6. The fact that your test passed proves that it is a flaky unit test. The test has a 1/7 chance of failing. The fact that it didn’t fail when you ran it is not surprising :)

DI, your new best friend

The anti-pattern to take away from the previous example is that the Dice class relies on a library and that the library is contained in the class. The fact that it can’t be injected means that you can’t control it.

Dependency Injection is your friend!

class Dice {
  private final Random random;
  private final int numberOfFaces;

  Dice(final Random random, final int numberOfFaces) {
    this.random = random;
    this.numberOfFaces = numberOfFaces;
  }

  public int roll() {
    return random.nextInt(numberOfFaces - 1) + 1;
  }
}

Now, your test can work with a mocked Random instance for more accurate results.

@Test
public void shouldRollValidNumberOnDice() {
  final Random random = mock(Random.class);
  when(random.nextInt(5)).thenReturn(0, 1, 2, 3, 4, 5);

  final Dice dice = new Dice(random, 6);

  assertThat(dice.roll(), is(1));
  assertThat(dice.roll(), is(2));
  assertThat(dice.roll(), is(3));
  assertThat(dice.roll(), is(4));
  assertThat(dice.roll(), is(5));
  assertThat(dice.roll(), is(6));
}

We’re currently making 2 assumptions on the collaborator.

random.nextInt is always called with parameter 5
random.nextInt(5) always returns values between 0 and 5

The first assumption is in part validated by the mocking library. If Dice called by any other parameter, the results wouldn’t be what we want. But if you want to be extra sure, you could always make the test fail using an argument captor

@Test
public void shouldRollValidNumberOnDice() {
  final Random random = mock(Random.class);
  final ArgumentCaptor argumentCaptor = ArgumentCaptor.forClass(Integer.class);
  when(random.nextInt(argumentCaptor.capture())).thenReturn(0, 1, 2, 3, 4, 5);

  final Dice dice = new Dice(random, 6);

  assertThat(dice.roll(), is(1));
  assertThat(dice.roll(), is(2));
  assertThat(dice.roll(), is(3));
  assertThat(dice.roll(), is(4));
  assertThat(dice.roll(), is(5));
  assertThat(dice.roll(), is(6));
  assertThat(argumentCaptor.getAllValues(), is(asList(5, 5, 5, 5, 5, 5)));
}

The second assumption should not be validated by you. If you look at the documentation for random.nextInt() you will notice

/**
 * Returns a pseudorandom, uniformly distributed {@code int} value
 * between 0 (inclusive) and the specified value (exclusive), drawn from
 * this random number generator's sequence.
 ...
 */
public int nextInt(int bound) {...}

It is the responsibility of the library (java.util.Random in this case) to test itself.

How do I know Random will not misbehave? I don’t. The Dice component could be integration tested. It is an absolute necessity if you deem the component to be an untrusted collaborator. If this was a database connection or a REST call, you’d want that. For a Java util or a well tested open source library, you could be forgiven for not writing an integration test.

In this case, I won’t be writing one for sure! ☺

Movement to Cybershark

2015-11-28T11:09:58+05:30

I had been procrastinating movement to a dev-ops style Chef deployment for my servers to ease it’s management because of the age old “If it ain’t broke..”. Well, upgrades on Bumblebee were getting more expensive so I finally decided to take the leap. I introduce the trion cookbook that I’m using to setup my servers.

I also realized that Kimsufi came up with cheaper servers (now as low as €4.99). Without automation to setup my servers, the thought of migration and building another snowflake server scares me purely because of it’s frailty. No more.

I’ve now officially moved to my new server, Cybershark!

I have made it a point not to have PHP on this server. It’s time to stop my dependency on it. Good bye Wordpress. Welcome Octopress. Good bye Soccer Scraper. Google Now will do fine :)

This means some of the internal dependencies I had (such as PHP mailers) needs to now be moved to another technology. Hello Node.JS. I think we might be good friends :)

Modern Operating Systems Phoning Home

2015-10-03T19:52:00+05:30

It seriously irks me when general public operating systems build in default features that send data to their servers without clearly indicating so. Both Microsoft (with Windows 10) and Apple (with Yosemite) have done so. Disabling these features doesn’t take long so here’s what you need to do.

Windows 10

There’s a well written reddit page by user hazehk which lists all the setting changes required. Takes around 10-15 minutes to run through them.

If you’re a bit lazier, you could use tools to do it for you if you aren’t paranoid about the tools themselves :)

OS X Yosemite

Fix my Mac OS X talks about the simple changes required to alleviate your pain. You could either follow their steps or use their python script. Your choice!

It has only 2 steps so it should take you less than a minute to do both. I actually went as far as to disable Spotlight all together (remove spotlight shortcut; disable spotlight indexing causes problems with Alfred because it needs spotlight’s cache for application data) and move to Alfred (not the least of which was motivated by the privacy issues..)

Reducing Maven Package times due to resource copying

2015-07-11T12:36:33+05:30

I once worked on a web application with a 250+MB code base. This consisted of 200,000 images. For every development cycle, you had to compile and deploy the code on the server which was painful to say the least. The size wasn’t the problem as much as the number of resources. The code took less than 20 seconds to compile.

We figured that compilations overwriting class files were OK but having to edit any resource just took too long. In such cases, you can use maven’s process-resources plugin to ask maven to only copy the new resources to your target directory.

This is significantly faster. Package times went down from 6 minutes to 18 seconds. Of course, a SSD would have helped but looking at the difference, it’s well worth the effort :)

mvn process-resources

Go try it out!

Maven Compilation in Ram Drive

2015-07-11T12:16:42+05:30

If you’re working on huge maven projects and have a slow disk, compilation, packaging and install times can sore quite high. If getting faster hard disks isn’t possible, why not try moving the compilation to a ram drive?

A code base which used to take 22 minutes to compile went down to 3 minutes. This just goes to show the effect that disk IO bottlenecks can have on your system.

What is a RAM Drive?

RAM is significantly faster than spinning platter HDDs. Until the commercialisation of static state devices, it wasn’t even a competition. It’s meant to be like that.

If you need higher writes speeds to disk which don’t need to be persisted over a long period of time, why not use some of your spare RAM as a hard disk? This can be achieved using software.

How do I get a RAM drive on my machine?

RAM drive softwares are available for all major operating systems.

SoftPerfect’s RAM Disk is a good free tool for non-commercial processes for Windows. Linux has tempfs and ramfs.

Compiling your Maven project on to a RAM drive

Most big projects follow a multi-module POM structure. You can move your target directory in 2 ways.

Change build directory

You can change the output directory to the desired path ensuring all compiled files go to your RAM drive. Just make sure you qualify your path well using the group and artifact IDs to ensure different projects don’t overwrite each other’s compiled code.


    
      /mnt/ramdrive/compile/${project.groupId}/${project.artifactId}

The issue with this approach is that all users on the team are now bound by

Your compilation directory path (terrible idea)
Using a ram drive (not everyone might need it)

The first issue can be fixed by creating a property for the base path. This parameter can be passed as a compile time parameter with a default set in the POM


    
      ${target.baseDir}/${project.groupId}/${project.artifactId}

The second issue can’t be fixed using this approach. This can be fixed with the second approach

Compilation profiles

Maven allows you to have profiles for applying configs in specific scenarios.


    
      ramDrive
      
        
          ${target.baseDir}/${project.groupId}/${project.artifactId}

Now you can compile your code with a profile and it will use the specified directory for compilation

mvn clean install -PramDrive -Dtarget.baseDir=/mnt/ramdrive/compile

Congratulations, your code is compiled on RAM drive. Is it still not fast enough? Is the installation process slow? Well, you could move your M2 directory to the ram drive too

Local Maven repository to RAM drive

You can change your maven configuration to ask it to move your local maven repository home (which is the place where all the artifacts you build and/or download are stored).

Update your settings.xml with the location of your local repository

/mnt/ramdrive/mvn-repo

Forced HTTPs on your website with CloudFlare

2015-02-01T21:04:49+05:30

I’m a supporter of the HTTPS everywhere movement by the EFF. They advocate users use (all) websites with HTTPS for extra security. This means everyone should probably fork out a few dollars to get their own certificates. Unless you’re buying a domain at NameCheap (in which case they tend to throw in a SSL certificate for the first year), you’d have to shell out $8-$12 to get one.

Side note, I recommend every user have HTTPS everywhere installed on every browser.

Though it’s not perfect, you can get a SSL for your website for free.

Why do I need SSL?

The communication between you and a website looks something like this without SSL.

It’s prone to a Man in the middle attack. This could be done by your neighbour tapping into the line, your ISP or someone half way across the world listening in on the server you request data from.

What would you have me do?

Move your domain to CloudFlare

CloudFlare is a CDN that provides you with additional security by analysing requests using crowd sourced data over hundreds and thousands of websites. Add your domain to CloudFlare and ask your DNS provider to send all requests to the CloudFlare servers. Once the setup is complete, your data will be sent to your server via CloudFlare. It ensures your IP is not exposed outside there by providing it with some amount of Denial of Service attack prevention since your IP is not directly exposed and it is CloudFlare’s job to handle in coming Denial of Service attacks (once set up).

Users must note that the move to CloudFlare means you can’t SSH to your machine anymore because they do not forward the port. You can create a subdomain that doesn’t route traffic through CloudFlare. That makes it easier to SSH/FTP into the box but provides a way for attackers to access your machine bypassing CloudFlare’s security. Alternately you can add a vhost entry on your machine ensuring you can connect with ease but this won’t help you if you would like to connect from some other machine. You could just remember the IP if you’re a pro :P The choice is yours!

Enable CloudFlare’s Universal SSL

CloudFlare provides a universal SSL for all domains routed through their service. As long as you trust CloudFlare’s SSL keys not to be leaked (if they do, bigger businesses would have a problem way before your website does).

This feature is available for free to all users. I recommend using at least the Flexible SSL which requires no further setup. Turn it on and you can hit your website using https. For now, choose Flexible SSL.

Force HTTPS

Ideally you want HTTPS to be an option your users choose. You’d want to at least make it a default. If people don’t know, they wouldn’t move to using it. I see no reason why browsers these days won’t be showing your content correctly if they are in HTTPS so I recommend having it on as a default.

If you run a blog or website where there are no “users” but readers, it’s hard to let them choose their HTTPS settings. If you haven’t noticed already, you are accessing this website using HTTPS. Welcome to the dark side.

No need to write a htaccess or similar config on your server. Go to CloudFlare and create a HTTP rule for your base domain asking for a redirect to HTTPS.

Are we there yet?

Not quite.

The communication between you and CloudFlare is secure. The communication between CloudFlare and your server isn’t.

Adding SSL for CloudFlare-Server communication

You can create a self signed certificate on your server. For any Unix based server you can use OpenSSL to generate it. The internet is filled with tutorials galore. Most of them will have information on how to install with your web server of choice (for example apache or nginx).

Once you’re done, go back to the CloudFlare settings for your domain and change the option to Full SSL.

Potential Flaws?

You’re pretty secure if you ask me. The vulnerable points in the system are the CloudFlare server and your server. Ideally, CloudFlare protects the identity of the server ensuring all requests go through their servers which are meant to protect you from attacks. Lets assume CloudFlare’s private key isn’t compromised. This means the only way to decrypt requests to your server is by getting the private key off your machine and listening into requests to your server.

Another potential area of concern is the fact that the CloudFlare to server communication is prone to a Man in the middle attack since CloudFlare doesn’t verify the signature from from the server for free accounts. For this, you would need to move to a CloudFlare Pro account which at $20/month (as of the writing of this post) is significantly more expensive than purchasing your own SSL certificate end to end. Of course, we don’t use CloudFlare just for the free SSL but the CDN and crowd sourced security it provides. This is hands down the most serious vulnerability in the system I can spot.

You’re better off than where you started and all this for free! Want more? Shell out money to a CA to get your very own certificate :)

*[CA]: Certificate Authority

Signs of a troubled startup

2015-01-31T22:36:08+05:30

India has been developing the startup culture quite seriously over the past half a decade. In this time I’ve seen dozens of ventures pop up and quite a few of them fizzle out. If you’ve gone through a startup bootcamp or an equivalent program, you’ve probably heard the stats. Failure rate of startups is pretty high. I admit, I am concerned. Not about the high failure rate. It’s about why quite a few Indian startups are failing these days.

I am concerned because they seem to not be getting their basics right. Most startup schools and accelerators tell you things that usually go wrong. Repeatedly. Yet there are other things they don’t tell you. Things they’d think are common sense.

Sometimes it is hard to spot these from the outside but if you do see these signs, you better have taken your ERT training seriously because the building is on fire and you better be prepared to evacuate people.

Acting as the ‘next big thing’

Everyone wants to be the “next big thing”. Everyone thinks they are the next Microsoft, Google, Facebook or Flipkart. Once in a while they are right. Usually they are wrong. Wanting to be great isn’t a problem. Walking around thinking you are is. If you’re Tony Stark and can pull it off, that’s good for you. Most can’t.

Wanting to be too big too quickly

Some people want to do good work every day. It makes them happy. Success, for them, is a by product of doing great work.

Others want to be successful. The journey, to them, is meaningless and the destination is all that counts. Such people do not want to take the time to hone their craft. Usually such folks are impatient and won’t give the time it takes for the business to grow.

Not shipping your product till “it’s perfect”

Take it from everyone who’s ever built anything. It’s not perfect. It’s never going to be. Unless you’re quite loose with your definition of the word, perfection is near impossible to attain. Delaying your launch is going to be your company’s death sentence.

Expecting to build your product on your client’s cost

This is the exact opposite of the previous point.

If you have a Minimum Viable Product and your client want’s to pay you for it, it’s great! You can use the money to support the cost of additional features. This is the recommended way to go to market.

Some entrepreneurs however would like to build the product entirely on the client’s cost. Believe it or not, this happens more often than one would think. How is this possible you ask?

Ethics or lack there of

Ethics are important. It is important how organisations treat_ _people.

When it comes to their clients, I believe they deserve to know what they are actually buying. This means being truthful about what features your product has. Claiming your product does things which you’ve not even started developing is a clear no-no.

When it comes to their own people, clear communication is important. Respecting them is key.

Early hires are ‘employees’ not stakeholders

Hiring is key. You can have the greatest product the world has ever seen but if you hire incorrectly, you’ll have problems. This is a well known fact.

What most people don’t think of is that if you hire people and treat them as employees and not as stakeholders, they are going to treat work as a “job” and not as “their baby”. If I were you, I’d like my team to have that sense of ownership. It’s not easy but it’s not impossible either. Responsibility is to be given (stop trying to control everything; stop being a control freak) and responsibility is to be accepted. It’s a two way street. But it starts with your show of faith. It starts with you wanting to let people into the bubble you’ve built around yourself.

Not taking (internal) feedback

Not taking client feedback is almost always a silly move to play. Most people don’t do that.

What a few small organizations will miss out on is not listening to feedback from people they’ve hired build their product. First off, your team needs to be convinced about the product they are building. If they have feedback about the way things are being done, you should take it seriously. You’ve brought them in because you trust them and what they can do. So when they have things to say you better listen to do even if it doesn’t agree with your view of the world.

Surrounding yourself with Yes Men

If the team always agree with one another, I’d consider it a orange flag. If they ostracise anyone who disagrees with them, that’s a definite red flag. Any team who discourages healthy discussions instead of understanding different view points and coming to a consensus is one to stay away from.

If you’re in discussion with such a team (as an insider or an outsider) and your points are something they disagree with but aren’t able to refute, you might see them not conceding and admitting you’re correct. Which brings us to the next point.

Drinking your own Kool Aid

If you’re building something, you keep bottles of Kool Aid handy to give out to your users/clients. You must not dip into your own stash and take a sip or worse, gulp down entire bottles daily. Keeping your horizon indicator in check is key and large quantities of Kool Aid don’t help. They provide a self affirming view of the world which could be your startup’s death sentence. When it comes down to it, you should be ready to pivot for which you shouldn’t have blinders on.

Never learning from your failures or admitting them

Fail Fast. Learn. Move on. If you can’t do that, you’re doing it wrong.

I realise it sounds cliche but people like showing their world their perfectness. If you haven’t already, I recommend watching The Myth of the Genius Programmer from Google I/O 2009. It’s a real eye opener. Especially the part where they mention why failed spike branches shouldn’t be deleted (from 35:51 to 36:35).

Never considering they might not make it big but be good and that’s enough

For most people these days making it big is important. This could either be in terms of huge revenues or a 100+M dollar buy out. For most, having a steady NOI is just not good enough. It’s not good enough because it’s not what they had dreamt of. That Ferrari in the drive way and early retirement at 28.

Sounds outlandish but the Hollywood romanticisation of how businesses become worth billions has lead to people wanting to live the dream. Well son, this is reality. It’s hard work.

What’s my way out?

These are just some of the things which I believe everyone should think about. They seem obvious but heck, hindsight is always 20/20. It is quite easy to make some of these mistakes. Chances are you’ve seen someone around you exhibit these traits.

I highly recommend regular retrospectives with your team. More importantly, I recommend regular self retrospectives. Glass houses and all.

Net Neutrality: Why should I care?

2014-12-29T22:04:17+05:30

Yes, that’s one of the most often asked questions right after “What is that?”. I don’t often provide this explanation to techies because doing that is as simple as giving them a few lines of explanation and pointing them to a good website with information. The real challenge, if you ask me, is explaining this problem to the everyday Joe out there. Especially since, in a place like India you’re likely to hear “we have bigger problems than __". This to me sounds like a blanket excuse for not wanting to deal with an issue at hand and here's how I tend to answer such queries.

What is net neutrality?

Net Neutrality is principle that all data on the internet needs to be treated equally by service providers and governments. This means for example your YouTube videos should stream at the same speed your Vimeo videos*all other conditions being equal.

Would you want your internet service provider to slow down YouTube so as to make you move to one of its competitors?

This is a very brief description. At the end of the post are some resources to help you understand what Net Neutrality is if you’re new to this discussion.

Net Neutrality isn’t a 1% problem.

Do you love your access to the internet? In this day and age, I view the Internet as a medium to impart information to those who didn’t have access to resources earlier. Over the past 15 years the introduction and mass penetration of smartphones in India is probably one of the greatest things that to have happened to our country. Have you ever gone driving between states with a map in your bag, gotten lost and spent time on the side of the road getting conflicting information about where you should go? I have. It sucks. With easy access to online maps, that’s rarely a concern these days :) This isn’t the only use case I can quote. I’m sure everyone has one. Yet, we undermine the significant contribution the internet has had in our lives (and I’m not talking about an endless stream of cat photographs on Facebook :))

Would you let this happen in any other industry?

You walk over to your local vegetable market and ask for 1kg of tomatoes. The vendor says it’s Rs. 10/kg if you’re making a salad and Rs. 30/kg if you’re making a curry. Would you be fine with this? Why does your vendor care what you’re doing with it?

Does your electricity company ask you for different rates for charging cell phones/laptops vs using TVs vs using ACs? It’s electricity. A unit is charged the same way irrespective of what you do with it.

Why should ISPs charge you different rates?

Claim: ”But I heard they want to charge us more to provide us faster connections for ‘special’ data”

No. They don’t.

With Net Neutrality

Regular data - normal speed
‘Special’ data - normal speed

Without Net Neutrality when you pay them extra

Regular data - slower speed
‘Special’ data - normal speed

Without Net Neutrality when you don’t pay them extra

Regular data - slower speed
‘Special’ data - slower speed

This is, as John Oliver puts it, a classical mob shake down. Video is at the end of the post

Note: ‘Special’ Data refers to whatever they want to charge you extra for. Might be something like video streaming or VOIP.

Claim: “They want to charge extra because using the internet for causes them to lose business"

Just so we’re clear, this is something I’ve heard in the Indian context. The services in question are VOIP services (such as Skype and Viber) and messaging services (mainly WhatsApp).

Lets take the example of WhatsApp. Telecos seem to be losing out on revenue for SMS services. At 30p/SMS, SMS is the most expensive (if not one of the most) form of communication used by the masses I can think of. The extremely high cost is precisely the reason why services like WhatsApp and Hike were able to come into the market.

Not being able to compete in a sector is no reason to stifle it using your monopoly as a mobile internet provider.

Lets say all dairies in your country sold cheese at insane rates. What if you found a way to produce (arguably?) a better quality cheese product for a lower price than the one dairies sell their’s at despite having to buy milk from them because you don’t have a source. Would I be fine with them increasing milk rates so you couldn’t afford to make the cheese cheaper any more? I wouldn’t. That’s why I don’t think charging extra for online messaging or VOIP services is acceptable.

Lets just boycott that telecom provider! That’ll teach ‘em.

No. It won’t. For two reasons mainly.

1) They have far too many users

Each telecom provider has tens if not hundreds of mobile users around the country. If your aim is to make them feel the loss of revenue, you’d have to have their revenue to go down at least a few percentage points. This means making 100,000 to a million (10,000,000) users switching service. If you had that kind of support, you probably have a better chance getting the government to take notice and do something.

2) Your provider isn’t the only one violating net neutrality

Sure, if there was an provider which stood by Net Neutrality, chances are that most of us would move to use their service. Not that it would help, but it’s a matter of principle. Still, that doesn’t fix the problem at hand.

Multiple major players in the market have considered violating neutrality in some form over the past few years. Even ones who haven’t would probably do so once they see their competitors increasing their earnings.

Airtel rolled back their decision. We’re saved!

We’re not. Not to sound gloomy and all but the battle is far from won. There is affirmative action still required to ensure Internet in India remains a level playing field. We need to ensure people with vested interests do not take advantage of the system and this can only happen through awareness. There are multiple issues plaguing internet access in India not the least of which are privacy and (lawful/unlawful; a debate in itself) censorship.

What can I do? I am another person.

Sure you are. You should never underestimate the power of one person in a democracy. Learn more about this issue. Become an advocate. Make others around you aware why this is a concern. Awareness and will to stand up against this is the only way to go forward.

Mac: Camera not available

2014-12-10T01:06:22+05:30

My Macbook Pro sometimes doesn’t detect it’s web camera when I’m trying to join a video call and it’s painful to have to reboot the machine to fix it. A simpler way (especially if you have root access to your machine) is to kill VDCAssistant from the command line.

sudo killall VDCAssistant

Once you’re done, restart the application that was attempting to use your web camera :)