Skip to main content

33 posts tagged with "sre"

View All Tags

· 2 min read
John Lotoski

High level summary

The SRE team continues work on CI and cardano environment improvements. Some notable recent improvements include: expanding the darwin CI cluster and providing new aarch64 builder support; adding bare metal bitte cluster capability with network overlay for high IOPS workload performance, such as explorer.

Lower level summary

Bitte

  • Equinix bare metal capability was added to bitte: bitte-pull-194
  • Update bitte nixpkgs, nix version, nomad driver, equinix lifecycle, misc bug fixes: bitte-pull-201

Bitte-cells

Cardano-graphql

Cardano-node

Cardano-ops

Cardano-world

Ci-ops

  • Update legacy darwin builders and buildkite agent for ci-world network overlay and monitoring: ci-ops-pull-108

Ci-world

Cicero

  • Implement a cicero webhook backoff with exponential decay plus jitter: cicero-pull-79

Iohk-nix

Openziti

· 2 min read
Michael Fellinger

High level summary

The SRE team continues work on Cicero, Tullia, and Bitte, as well as providing support for cardano-world.

Lower level summary

Cicero

  • Fixed various race conditions around transformers.
  • Brought our CI up to date.
  • Migrated to the Nomad exec driver with Nix support for many actions.
  • Moved Nix builds to the Nomad clients for much better cache locality.
  • Ongoing work on vastly improving the action matching and evaluation speed.

Tullia

  • Made it easier to support cloning from a PR's fork
  • Update to latest std
  • Add workaround for cgroup issue: nomad#12877
  • github preset: add github.ci.remote and (read|get)Repository functions
  • Fix various issues around CUE handling

Bitte

  • Upgrade to NixOS 22.11
  • Prototype usage of Colmena for deploys instead of deploy-rs
  • Finalized work on Equinix Metal support
  • Prototype better secrets management with ragenix instead of sops-nix
  • Improve CI and bring it up to date

cardano-world

  • Fixd various OOM issues on preview and preprod
  • Rotated KES keys on preview and preprod
  • Optimize mainnet db-sync to cope with higher load
  • Fix an issue where PostgreSQL would fail after a reboot

bitte-world

  • Updated to NixOS 22.11

ci-world

  • Updated to NixOS 22.11
  • Added Equnix cluster
  • Improve caching of Nix builds

· 4 min read
Michael Fellinger

High level summary

The SRE team is heavily working on the Equinix Metal migration, replacing Hydra with Cicero, and a new version of Spongix.

Lower level summary

OpenZiti

  • Work is ongoing on our OpenZiti integration into Bitte in [bitte-zt].
  • CI-World deployment of Darwin CI Ziti service in [ci-world-commit-d40f4d].
  • Multiple issues filed, and a lot of discussion with the OpenZiti developers, we're making pretty rapid progress thanks to them.
  • Work on getting Equinix baremetal machines integrated into AWS World Bitte clusters utilizing a Ziti ZTNA network overlay to bridge the networking of the two environments and get IAM extension to Equinix machine for Nomad client onboarding.
  • A Nix Flake for most of our OpenZiti dependencies including the Console, Controller, Edge Tunnel, and Router is now at [openziti-bins].
  • The Flake also includes a WiP NixOS modules for these components.
  • Tested Ziti Desktop Edge official app for Darwin x86_64 w/ GUI -- works with no issues seen so far
  • Moved the console to traefik routing service (zac.$DOMAIN) and controller/edge router stay at zt.$DOMAIN, but have registered consul services

Cicero & Tullia Integrations

Cicero & Tullia Features

  • Improvements to Tullia task aggregation to make [cardano-addresses] build correctly.
  • Better tullia CUE lib default for tags [tullia-commit-4df3c5d].
  • Put cache.nixos.org back in cache.iog.io's upstreams. This is now considered a public cache again, and without it some Cicero evaluations had to build huge packages.
  • Started working on a flake-parts module for Tullia.
  • Started working on cutting down Tullia task build time by putting facts in JSON files.
  • Fixed running into kernel arg limit by reading tullia's DAG from a file
  • Merged [tullia-pull-9] that fixes several issues related to error reporting. and escaping.
  • Added Mac builders in Cicero on CI-World.
  • Started work on Tullia invocation caching.

Spongix

  • A lot of progress on an SQlite backed version of Spongix, it already supports the full HTTP binary cache protocol but still lacks comprehensive testing and some tuning, as well as recursive lookups.
  • First steps in the implementation of the nix-daemon ssh-ng protocol so Spongix can be used via SSH and we can get rid of basic auth.

Bugs

  • Discovered Cicero bug where Nomad reschedules cause the Github commit status to get stuck in pending
  • Discovered Cicero race condition bug around concurrent transactions for codependent actions.
  • Fixed tullia task order bug in [cardano-addresses]
  • Diagnose Cicero action not triggered in [abcirdc]
  • Fixed meta/description of the Tullia package in [tullia-pull-7]
  • Add Vault token loop alerts in [bitte-cells-pull-40]
  • Ongoing investigation on recurring Patroni and nomad-follower issues related to token rotation.