Service Fabric Handbook

Welcome!

This post is a handbook where you probably will find an answer to “How different things works in Azure Service Fabric?” question. Almost all of the information (around 90%) is from docs.microsoft.com, GitHub and etc., the rest of it are personal findings made during development of CoherentSolutions.Extensions.ServiceFabric.Hosting.

Where possible the information will be confirmed by reliable source (docs, issues or posts).

Table of Contents

Interlude


Azure Service Fabric is a distributed systems platform that makes it easy to package, deploy, and manage scalable and reliable micro-services and containers. SF operates on a so called cluster. A cluster is a network-connected set of physical or virtual machines into which micro-services are deployed and managed. A physical or virtual machine that is part of a cluster is called a node. The workload is represented in form of applications where each application consists from services where each service is represents as one or more replicas deployed to nodes [*].

Object Model


All conceptual objects in SF model are represented by type and instance. The type represents a unique named basic definition of something i.e. the type can define attributes and properties that all instances of this type will share. The instance in turn represents an independent uniquely named projection of the type initialized with specific set of parameters.

The closest analogy between type and instance is a class and instance of that class in object oriented language like C# or Java.

Author’s note

Besides type and instance SF devotes one more concept – replica. The replica represents a service’s code executing on the node.

NodeType & Node


NodeType is a uniquely identifiable definition of the hosting environment. This definition contains abstract information about hardware and software capabilities in form of properties as well as SF specific information i.e. communication ports.

Azure Hosting Services

When using Azure Hosting Services only a subset of NodeType configuration can be modified when creating or at runtime.

Standalone Cluster

When configuring standalone cluster NodeType is defined in the ClusterConfig.json. Only a subset of configuration can be updated at runtime.

"nodeTypes": [{
  "name": "NodeType0",
  "clientConnectionEndpointPort": "19000",
  "clusterConnectionEndpointPort": "19001",
  "leaseDriverEndpointPort": "19002"
  "serviceConnectionEndpointPort": "19003",
  "httpGatewayEndpointPort": "19080",
  "reverseProxyEndpointPort": "19081",
  "applicationPorts": {
    "startPort": "20575",
    "endPort": "20605"
  },
  "ephemeralPorts": {
    "startPort": "20606",
    "endPort": "20861"
  },
  "isPrimary": true
}]

Example: Configuring NodeType in ClusterConfig.json

The complete description of all properties of nodeType in ClusterConfig.json can be found in Microsoft.ServiceFabric.json schema (the referenced version is of 2018.02.01).

Node is a uniquely named instance of NodeType that represents a dedicated hosting environment where replicas are executed. It is usually represented by physical or virtual machine (or operating system process in case of local development cluster).

The relationship between NodeType and Node is shown on the picture bellow:

Illustration of relationships between Node and NodeType
Illustration of relationships between Node and NodeType

Azure Hosting Services

When using Azure Hosting Services each NodeType is mapped to VMSS (Virtual Machine Scale Set) [*]. Azure allows you to scale both vertically by increasing power of nodes and horizontally by adding additional nodes (up to 100 per NodeType [*]).

Standalone Cluster

When configuring standalone cluster nodes are configured either in ClusterConfig.json

"nodes": [{
  "nodeName": "vm0",
  "iPAddress": "localhost",
  "nodeTypeRef": "NodeType0",
  "faultDomain": "fd:/dc1/r0",
  "upgradeDomain": "UD0"
}]

Example: Configuring Node in ClusterConfig.json

… or added dynamically using PowerShell.

.\AddNode.ps1 `
  -NodeName vm0 `
  -NodeIPAddressorFQDN localhost `
  -NodeType NodeType0 `
  -FaultDomain fd:/dc1/r0 `
  -UpgradeDomain UD0 `
  -AcceptEULA

Example: Dynamically adding Node using PowerShell

ServiceType & Service


ServiceType is a uniquely identifiable, versioned definition of service as a separate unit of application logic. The type is declared in ServiceManifest.xml which includes: list of Code, Config and Data packages, resources and diagnostics configuration.

The complete list can be found in ServiceManifest.xml schema and documentation.

<ServiceManifest
  Name="ApiServicePkg"
  Version="1.0.0"
  xmlns="http://schemas.microsoft.com/2011/01/fabric"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  
  <ServiceTypes>
    <StatefulServiceType
      ServiceTypeName="ApiServiceType"
      HasPersistedState="true" />
  </ServiceTypes>

  <CodePackage
    Name="Code"
    Version="1.0.0">
    <EntryPoint>
      <ExeHost IsExternalExecutable="true">
        <Program>dotnet</Program>
        <Arguments>Api.dll</Arguments>
        <WorkingFolder>CodePackage</WorkingFolder>;
      </ExeHost>
    </EntryPoint>
  </CodePackage>

  <ConfigPackage
    Name="Config"
    Version="1.0.0" />

  <DataPackage
    Name="Data"
    Version="1.0.0" />

  <Resources>
    <Endpoints>
      <Endpoint
        Name="ServiceEndpoint"
        Protocol="http"
        UriScheme="http" />
      <Endpoint Name="ReplicatorEndpoint" />
    </Endpoints>
  </Resources>
</ServiceManifest>

Example of ServiceManifest.xml

Each package defined in the ServiceManifest.xml has it’s own purpose [*]:

CodePackage

Contains binaries with the implementation of one or more service types logic. There can be multiple code packages defined in the same ServiceManifest.xml, so when the replica is instantiated all code packages are activated by hitting their entry points. It is expected through that code packages will register the service types declared in the ServiceManifest.xml.

DataPackage

Declares a folder, named by the @Name attribute, that contains arbitrary static data to be consumed by the process at run time.

ConfigPackage

Declares a folder, named by the @Name attribute, that contains a Settings.xml file. The settings file contains sections of user-defined, key-value pair settings that the process reads back at run time.

<Settings
  xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://schemas.microsoft.com/2011/01/fabric">

  <Section Name="Section">
    <Parameter Name="KeyOne" Value="ValueOne" />
    <Parameter Name="KeyTwo" Value="ValueTwo" />
  </Section>
</Settings>

Example of Settings.xml

In short service type is declared in the ServiceManifest.xml but it’s implementation is included in one of the code packages.

Illustration or relationships between ServiceType and Code, Data and Config packages
Illustration or relationships between ServiceType and Code, Data and Config packages

There can be multiple active versions of the same service type in the cluster.

Author’s note

Service is a uniquely identifiable, versioned instance of service type initialized with custom set of parameters. Service defines all replica related aspects like: placement in cluster (constraints, policies, packing), partitioning schema, replica count.

All services can be managed independently using SF management API i.e. using PowerShell we can modify replica count.

Author’s note

ApplicationType & Application


ApplicationType is a uniquely identifiable, versioned definition of collection of service types.

Illustration of relationships between ApplicationType and ServiceType
Illustration of relationships between ApplicationType and ServiceType

The type is declared in the ApplicationManifest.xml and includes: list of application services (as references to ServiceManifest.xml), security and policy information, default configuration etc.

The complete list can be found in ApplicationManifest.xml schema and documentation.

<ApplicationManifest
  ApplicationTypeName="AppType"
  ApplicationTypeVersion="1.0.0">
 
  <Parameters>
    <Parameter
      Name="MinReplicaSetSize"
      DefaultValue="3" />
    <Parameter
      Name="PartitionCount"
      DefaultValue="1" />
    <Parameter
      Name="TargetReplicaSetSize"
      DefaultValue="3" />
  </Parameters>
 
  <ServiceManifestImport>
    <ServiceManifestRef
      ServiceManifestName="ApiServicePkg"
      ServiceManifestVersion="1.0.0" />
 
    <ConfigOverrides />
    <EnvironmentOverrides CodePackageRef="Code">
      <EnvironmentVariable
        Name="ASPNETCORE_ENVIRONMENT"
        Value="Development" />
    </EnvironmentOverrides>
  </ServiceManifestImport>
  <DefaultServices>
    <Service
      Name="ApiService"
      ServicePackageActivationMode="ExclusiveProcess">
 
      <StatefulService ServiceTypeName="ApiServiceType"
        TargetReplicaSetSize="[TargetReplicaSetSize]"
        MinReplicaSetSize="[MinReplicaSetSize]">
 
        <NamedPartition>
          <Partition Name="Left" />
          <Partition Name="Right" />
        </NamedPartition>
      </StatefulService>
    </Service>
  </DefaultServices>
</ApplicationManifest>

Example of ApplicationManifest.xml

There can be multiple active versions of the same application type in the cluster.

Author’s note

Application is a uniquely identifiable, versioned instance of application type initialized with custom set of parameters. Application unites services into single management unit and allows to configure security, policies, diagnostics and service templates for quick service instantiation.

All applications can be managed independently using SF management API i.e. we can create or delete any service from the application without affecting the others.

Author’s note

Application & Service & Nodes


The picture bellow illustrates the relationships between types, instances and cluster.

Illustration of relationships between types, instances and cluster
Illustration of relationships between types, instances and cluster

Nodes

There are three nodes: /left-node, /middle-node and /right-node of the same NodeType which basically means they have the same hardware and software characteristics.

Applications & Services

Single application (fabric:/app-colors) with single service (fabric:/app-colors/srv-blue) of the ApplicationType and ServiceType correspondingly. The application was initialized with app-params parameters and service was initialized with blue-params parameters.

Replicas

Replicas of the service are activated on all of nodes with code from the same ServiceType and have access to DataPackage and ConfigPackage packages.

Cluster


Cluster is a set or interconnected nodes used to host applications and services. SF manages nodes and orchestrate applications in a way to ensure their durability, availability and scalability.

Orchestration logic is concentrated in Cluster Manager Service (CM). CM utilizes different layout concepts (fault domains, upgrade domains), scalability strategies (partitioning) and load metrics in order to satisfy the above requirements.

Fault Domains


Fault Domains represent a layout concept of segregation of resource by physical implementation. Fault domains are naturally represented as hierarchy of likely to fail components. The example of this can be two servers connected to the same power supply. While each server can fail by different reasons they both would fail in case of power supply outage so they both rooted to the same fault domain: /power-supply/server-1, /power-supply/server-2.

Illustration of Fault Domains
Illustration of Fault Domains

CM uses knowledge of fault domains when placing replicas to make sure they are correctly distributed across nodes from different fault domains. This reduces the effect of unexpected outages.

Azure Hosting Services

Fault Domains are defined automatically when using Azure Hosting Services. This information is picked up from the Azure automatically.

Standalone Cluster

When configuring standalone cluster fault domains are explicitly defined for each node in the ClusterConfig.json.

"nodes": [
  {
    "nodeName": "Node1",
    "faultDomain": "fd:/datacenter1/rack1/blade1/v1", 
    // 'Node1' is in "datacenter #1 in rack #1 on blade #1"
  },
  {
    "nodeName": "Node2",
    "faultDomain": "fd:/datacenter1/rack2/blade1/v6", 
    // 'Node2' is in "datacenter #1 in rack #2 on blade #1"
  }
]

Example: Defining Fault Domain in ClusterConfig.json

Upgrade Domains


Upgrade Domains represents a layout concept of splitting nodes into logical groups (domains) in a way where all services on nodes within the same domain are upgraded simultaneously while domains are processed sequentially. Upgrade domain can consist from one of more nodes.

Illustration of Upgrade Domains
Illustration of Upgrade Domains

Azure Hosting Services

Upgrade Domains are defined automatically when using Azure Hosting Services. This information is picked up from the Azure automatically.

Standalone Cluster

When configuring standalone cluster upgrade domains are explicitly defined for each node in the ClusterConfig.json.

"nodes": [
  {
    "nodeName": "Node1",
    "upgradeDomain": "up:/green", 
    // 'Node1' belongs to 'green' upgrade domain
  },
  {
    "nodeName": "Node2",
    "upgradeDomain": "ud:/orange", 
    // 'Node2' belongs to 'orange' upgrade domain
  }
]

Example: Defining Upgrade Domain in ClusterConfig.json

While the main justification when defining upgrade domains is quite obvious – we need to make sure that not all replicas will be shutdown during upgrade. There is also a second justification. We also need to make sure that remaining nodes will have enough capacity to handle additional load.

It is important to understand that while fault domains are dictated by the underlying hardware infrastructure. Upgrade domains are more like ‘tags’.

Author’s note

Reliable Services


Stateful Service


Stateful Service represents a service that preserves state between requests. It’s Replicas are divided into two types: primary and secondary. Primary replica has rights to read and manipulate state while the secondary replica has a readonly access to it.

The state is preserved within the service without involving a third-party agent. This is done by having a full copy of state within each replica (Picture A).

Stateful Service state persistence diagram
Picture A. Stateful Service state persistence diagram

State consistency between replicas is achieved by replication of all state modifications done by primary replica to all secondary replicas (Picture B).

Stateful Service state replication diagram
Picture B. Stateful Service state replication diagram

From developer’s standpoint state represents a sort of hash-table were you can put and retrieve state objects. All interactions with state are done using IReliableStateManager while values itself are usually stored in special Reliable Collections.

There also a way to introduce custom reliable state objects by implementing IReliableState interface.

Author’s note

There are three available reliable collections: Reliable Dictionary, Reliable Queue and Reliable Concurrent Queue.

Reliable Dictionary

Represents a replicated, transactional, and asynchronous collection of key/value pairs.

Reliable Queue

Represents a replicated, transactional, and asynchronous strict first-in, first-out (FIFO) queue.

Reliable Concurrent Queue

Represents a replicated, transactional, and asynchronous best effort ordering queue for high throughput.

All reliable collections are: replicated, persisted, asynchronous and transactional.

Replicated

State changes are replicated for high availability.

Persisted

Data is persisted to disk for durability against large-scale outages (for example, a datacenter power outage).

Asynchronous

APIs are asynchronous to ensure that threads are not blocked when incurring IO.

Transactional

APIs utilize the abstraction of transactions so you can manage multiple Reliable Collections within a service easily.

All replicas keep all the state both in memory and as persisted to disk [*]. This is done to ensure best performance when accessing state. When the state is persisted to disk all state values are serialized [*].

Life Cycle


Stateful Service implementation isn’t just an instance of class derived from StatefulServiceBase. When replica is building it’s implementation object has to be created, initialized and correctly registered in the SF runtime.

There are four life-cycle routines: Startup, Shutdown, Promotion and Demotion and Abort.

Pay attention that described life-cycles are slightly contradicts with the documentation. Please see issue on GitHub for details (why) and gist for code sample.

The implementation code is available on GitHub.

Author’s note
Startup

The startup routine is executed when replica is created by SF.

When replica is initialized as primary the following sequence of events is performed:

  1. Service’s implementation object is created.
  2. Service’s OpenAsync method is called and awaited.
  3. Service’s CreateServiceReplicaListeners method is called and awaited.
  4. All ServiceReplicaListener‘s returned from CreateServiceReplicaListeners are used to create instances of ICommunicationListener and have their OpenAsync methods called. The methods are called and awaited in sequence.
  5. Service’s ChangeRoleAsync method is called (with newRole = Primary) and awaited.
  6. Service’s RunAsync method is called.

When replica is initialized as secondary the following sequence of events is performed:

  1. Service’s implementation object is created.
  2. Service’s OpenAsync method is called and awaited.
  3. Service’s ChangeRoleAsync method is called (with newRole = IdleSecondary) and awaited.
  4. Service’s CreateServiceReplicaListeners method is called and awaited.
  5. All ServiceReplicaListener‘s where ServiceReplicaListener.ListenOnSecondary == true return from CreateServiceReplicaListeners are used to create instances of ICommunicationListener and have their OpenAsync methods called. The methods are called and awaited in sequence.
  6. Service’s ChangeRoleAsync method is called (with newRole = ActiveSecondary) and awaited.

The differences in startup sequences can be explained according to replica roles and method purposes.

RunAsync isn’t invoked on secondary replica

The RunAsync method is designed to allows replica to perform some sort of background job. Because in Stateful Service only primary replica has write access to the reliable state then it doesn’t make sense to invoke this method in secondary replicas because primary and secondary replicas share the same implementation and would expect write access.

ChangeRoleAsync is invoked twice on secondary replica

New secondary replica is always created in the IdleSecondary role because it should receive copy of the reliable state from the primary replica before this replica can be of any use. When copy of reliable state is received then replica continues the same startup sequence as primary replica with exceptions that final replica role is set to ActiveSecondary and RunAsync method isn’t invoked.

The important moment to understand is when the startup routines are executed – when primary or secondary replica are initialized from scratch. The secondary replica can be initialized from scratch at any time when CM decides. In contrary to this primary replica is initialized from scratch only when partition is being initialized – in all other cases when CM requires to move primary replica then either existing secondary replica is promoted to primary or new secondary replica is initialized from scratch and then promoted to primary.

The partition initialization sequence is illustrated on the picture below:

Illustration of Stateful Service partition initialization sequence.
Illustration of Stateful Service partition initialization sequence.

Important to note that primary replica doesn’t call RunAsync method until secondary replicas are built (i.e. have a copy of state). The amount of secondary replicas to await before running RunAsync in primary replica is determined by MinReplicaSetSize.

Promotion and Demotion

Promotion and demotion are natural part of Stateful Service partition life-cycle. These processes can happen for various reasons: primary replica movement (manual or automatic during resource balancing), primary replica failure, etc.

When secondary replica is promoted the following sequence of events is performed:

  1. All previously created ICommunicationListener have their CloseAsync method called and awaited. The methods are called and awaited in sequence.
  2. All ServiceReplicaListener‘s returned from CreateServiceReplicaListeners (during secondary replica initialization) are used to create instances of ICommunicationListener and have their OpenAsync methods called and awaited. The methods are called and awaited in sequence.
  3. Service’s ChangeRoleAsync method is called (with newRole = Primary) and awaited.
  4. Service’s RunAsync method is called.

When primary replica is demoted the following sequence of events is performed:

  1. All previously created ICommunicationListener have their CloseAsync method called and awaited. The methods are called and awaited in sequence.
  2. CancellationToken passed to RunAsync method is canceled.
  3. RunAsync method is awaited.
  4. All ServiceReplicaListener‘s returned from CreateServiceReplicaListeners (during primary replica initialization) where ServiceReplicaListener.ListenOnSecondary == true are used to create instances of ICommunicationListener and have their OpenAsync methods called and awaited. The methods are called and awaited in sequence.
  5. Service’s ChangeRoleAsync method is called (with newRole = ActiveSecondary) and awaited.

Pay attention that CreateServiceReplicaListeners method isn’t called during promotion-demotion routines. The code uses the same ServiceReplicaListener‘s returned from the first initialization.

Author’s note

If the promotion-demotion cycle is graceful (i.e. it was triggered by CM) then at first primary replica is demoted and then secondary replica is promoted. In case of primary replica failure SF tires (if possible) to perform graceful shutdown or primary replica and in parallel executes promotion sequence on secondary replica.

Shutdown

Shutdown routine can be triggered by various reasons including replica restart or replica failure. The shutdown routine describes the so-called graceful shutdown (i.e. the process isn’t crashed and SF initiates a shutdown).

When primary replica is shutdown the following sequence of events is performed:

  1. All previously created ICommunicationListener have their CloseAsync method called and awaited. The methods are called and awaited in sequence.
  2. CancellationToken passed to Service’s RunAsync method is canceled.
  3. Service’s RunAsync method is awaited.
  4. Service’s OnCloseAsync method is called and awaited.
  5. Service’s implementation object is destroyed.

When secondary replica is shutdown the following sequence of events is performed:

  1. All previously created ICommunicationListener have their CloseAsync method called and awaited. The methods are called and awaited in sequence.
  2. Service’s OnCloseAsync method is called and awaited.
  3. Service’s implementation object is destroyed.
Abort

The abort routine is executed in cases when replica has to be terminated very quickly i.e. there is an internal failure on the Node, replica is being killed programmatically using Remove-ServiceFabricReplica, etc.

When primary replica is aborted the following sequence of events is performed:

  1. All previously created ICommunicationListener have their CloseAsync method called and awaited. The methods are called and awaited in sequence.
  2. CancellationToken passed to Service’s RunAsync method is canceled.
  3. Service’s RunAsync method is awaited.
  4. Service’s ChangeRoleAsync method is called (with newRole = None) and awaited.
  5. Service’s Abort method is executed.
  6. Service’s implementation object is destroyed.

When secondary replica is shutdown the following sequence of events is performed:

  1. All previously created ICommunicationListener have their CloseAsync method called and awaited. The methods are called and awaited in sequence.
  2. Service’s ChangeRoleAsync method is called (with newRole = None) and awaited.
  3. Service’s Abort method is executed.
  4. Service’s implementation object is destroyed.

Stateless Service


Stateless Service represents a service that doesn’t preserve state between requests or uses external storage to store and restore state on each request. All replicas are equal and there is not difference what replica process the request.

Life Cycle


Stateless Service implementation isn’t just an instance of class derived from StatelessService. When replica is build it’s implementation object is created and has to be initialized and correctly registered in the SF runtime.

Stateless Service has three life-cycle routines: Startup, Shutdown and Abort.

Pay attention that described life-cycles are slightly contradicts with the documentation. Please see issue on GitHub for details (why) and gist for code sample.

The implementation code is also available on GitHub.

Author’s note
Startup

The startup routine is executed when replica is created by SF.

Here is the sequence of events:

  1. Service’s implementation object is created.
  2. Service’s CreateServiceInstanceListeners method is called and awaited.
    On error:
    ⮚ Service implementation object is recreated.
  3. All ServiceInstanceListener‘s returned from CreateServiceInstanceListeners are used to create instances of ICommunicationListener and have their OpenAsync methods called. The methods are called and awaited in sequence.
    On error:
    ⮚ All previously opened listeners have their Abort method called.
    ⮚ Service implementation object is recreated.
  4. Then in parallel:
    • Service’s RunAsync method is called.
      On error:
      If exception is OperationCanceledException and RunAsync was canceled:
      ⮚ Execution of RunAsync is treated as cancelled.
      ⮚ No additional actions performed.
      If exception is FabricException:
      ⮚ Partition Health status is temporary changed to Warning.
      ⮚ Service implementation object is recreated.
      If exception is OperationCanceledException and RunAsync wasn’t canceled or exception is unexpected type of exception then:
      ⮚ Partition Health status is temporary changed to Warning.
      ⮚ Executing process is terminated using Environment.FailFast method.
    • Service’s OpenAsync method is called.
  5. Service’s OpenAsync method is awaited.

It is important to understand that while RunAsync and OpenAsync are called in parallel, from the SF prospective replica isn’t initialized until the OpenAsync method is finished.

Shutdown

Shutdown routine can be triggered by various reasons including application upgrade, service scaling or replica failure. The shutdown routine describes the so-called graceful shutdown (i.e. the process isn’t crashed and SF initiates a shutdown).

Here is the sequence of events:

  1. All previously created ICommunicationListener have their CloseAsync method called and awaited. The methods are called and awaited in sequence.
  2. CancellationToken passed to Service’s RunAsync method is canceled.
  3. Service’s RunAsync method is awaited.
  4. Service’s OnCloseAsync method is called and awaited.
  5. Service’s implementation object is destroyed.
Abort

The abort routine is executed in cases when replica has to be terminated very quickly i.e. there is an internal failure on the Node, replica is being killed programmatically using Remove-ServiceFabricReplica, etc.

Here is the sequence of events:

  1. CancellationToken passed to Service’s RunAsync method is canceled.
  2. All previously created ICommunicationListener have their Abort method executed. The methods are executed in sequence.
  3. Service’s Abort method is executed.
  4. Service’s implementation object is destroyed.

Because service’s RunAsync method isn’t awaited then depending on the service hosting model the abort routine has slightly different constraints on how much time RunAsync has in order to finish cancellation:

ExecutiveProcess

The time is very limited and equals to the time required to execute all ICommunicationListener.Abort methods and service’s Abort method. Right after that hosting process is terminated.

SharedProcess

In normal circumstances when there is no critical reason to terminate host process there is no time limit.

From the above description it can sound like SharedProcess hosting model is better than ExecutiveProcess because it doesn’t limit the RunAsync in cancellation time. This behavior is just a side effect of how SharedProcess hosting is implemented currently and should be relied on when writing RunAsync code.

Author’s note

Partitioning


Partitioning is a cluster agnostic concept used to achieve scalability by grouping service replicas based on domain requirements. Partitioning isn’t related to cluster layout i.e. when choosing partitioning schema it shouldn’t matter how many nodes are in cluster and what configuration they have.

There are three supported partitioning schemes: singleton, ranged and named.

Singleton


When using singleton partitioning all replicas are assigned to the single group (no grouping).

Singleton Partitioning Three Nodes
Singleton Partitioning Three Nodes
Singleton Partitioning Six Nodes
Singleton Partitioning Six Nodes

Distribution replicas across three and six nodes clusters when using Singleton partitioning schema.

Ranged (UniformInt64)


When using uniform partitioning specified range of values (i.e. LowKey = 0, HighKey = 29) is divided into desired partitions count (PartitionCount = 3) in a way each partition is become responsible for particular subrange of values (Partition 1: 0-9, Partition 2: 10-19, Partition 3: 20-29).

Ranged Partitioning Three Nodes
Ranged Partitioning Three Nodes
Ranged Partitioning Six Nodes
Ranged Partitioning Six Nodes

Distribution of replicas across three and six nodes clusters when using Ranged (UniformInt64) partitioning schema.

Named


When using named partitioning each partition is addressable by partition unique string id.

Named Partitioning Three Nodes
Named Partitioning Three Nodes
Named Partitioning Six Nodes
Named Partitioning Six Nodes

Distribution of replicas across three and six nodes clusters when using Named partitioning schema.

Stateful Service


Partitioning has a special meaning for Stateful Service – each partition has it’s own isolated instance of reliable state and therefore own primary replica and multiple secondary replicas.

When configuring replica count Stateful Service uses two values – MinReplicaSetSize and TargetReplicaSetSize.

MinReplicaSetSize

MinReplicaSetSize is the minimum number of replicas required to preserve state and continue partition progress [*]. This value is used to calculate minimum quorum size [*]:

S_{\text quorum} = N \div 2 + 1

In case of quorum lose all future writes are disallowed.

TargetReplicaSetSize

TargetReplicaSetSize is the desired number of replicas in a cluster. Service Fabric will try to keep replica count as close as possible to this value.

When reliable state is updated by primary replica the change request MUST be accepted by replica quorum otherwise the change request is rejected and operation fails.

Author’s note

Both MinReplicaSetSize and TargetReplicaSetSize can be changed at runtime with C# …

var updateDescription =
  new StatefulServiceUpdateDescription();

updateDescription.MinReplicaSetSize = 5;
updateDescription.TargetReplicaSetSize = 7;

await fabricClient.ServiceManager
  .UpdateServiceAsync(
    new Uri("fabric:/app/service"),
    updateDescription);

Example: Updating MinReplicaSetSize and TargetReplicaSetSize at runtime using C#

… or PowerShell

Update-ServiceFabricService `
  -Stateful fabric:/app/service `
  -MinReplicaSetSize 5 `
  -TargetReplicaSetSize 7

Example: Updating MinReplicaSetSize and TargetReplicaSetSize at runtime using PowerShell

Stateless Service


Partitioning has no special meaning for Stateless Service hence it still can be used to achieve scalability by domain decomposition i.e. each partition can have it’s own database instance and each replica will use partition information to determine the appropriate database to use.

When configuring replica count Stateless Service uses single value – InstanceCount. The InstanceCount represents absolute number of replicas required hence SF allows to specify InstanceCount = -1 in order to produce one replica for each node.

The InstanceCount can be changed at runtime with C# …

var updateDescription =
  new StatelessServiceUpdateDescription();

updateDescription.InstanceCount = 5;

await fabricClient.ServiceManager
  .UpdateServiceAsync(
    new Uri("fabric:/app/service"),
    updateDescription);

Example: Updating InstanceCount at runtime using C#

… or PowerShell

Update-ServiceFabricService `
  -Stateless fabric:/app/service `
  -InstanceCount 5

Example: Updating InstanceCount at runtime using PowerShell

Communication


Cross-service communication is very important and complex topic in rapidly changing environment. SF has rich support for implementation of cross-service communication including: use custom, HTTP or Remoting protocols. Because replicas can be created / deleted / moved between nodes based on the decisions made by CM services there is no way to statically preconfigure services with any kind of IP address or port to contact to.

Replica address resolution is done using one of the SF infrastructure services – Naming Service. The high-level overview of cross-service communication is illustrated on Picture A.

Picture A. High level illustration of communication between primary replicas of 'Green' and 'Blues' services using 'Naming Service'
Picture A. High level illustration of communication between primary replicas of ‘Green’ and ‘Blues’ services using ‘Naming Service’

Contact to Naming Service doesn’t end up with the address of the particular replica rather than that the Naming Service returns a special Address structure that has all the endpoints (declared in ServiceManifest.xml) registered in partition and fills each endpoint address value with the physical address of one of the replica. The replica which address is returned are chosen based on the internal load balancing algorithms of Naming Service and passed TargetReplicaSelector.

Default (PrimaryReplica, RandomInstance)

if the service partition is stateful then communication is established with primary replica (PrimaryReplica).
else the RandomInstance replica selector is used.

RandomReplica

This selector indicates that communication can be established with any replica chosen in random – (i.e) primary or secondary (Stateful Service).

RandomSecondaryReplica

This selector indicates that communication can be established with any secondary replica chosen in random (Stateful Service).

The Picture B presents a detailed diagram of communication between two services.

Picture B. Detailed illustration of communication between primary replicas of 'Green' and 'Blues' services using 'Naming Service'
Picture B. Detailed illustration of communication between primary replicas of ‘Green’ and ‘Blues’ services using ‘Naming Service’

The Address structure contains values for each Endpoint declared in the ServiceManifest.xml. Depending on the type of the endpoint the value consists from IP and Port plus custom metadata.

There a few build in way to code the communication:

Replica Placement


Strategies


Fault and Update domains information is dictated by hardware infrastructure and nodes capacity and significantly affect the way replicas are distributed across the cluster. While there are multiple factors that affect how replicas are distributed SF support three major strategies: Maximum Difference, Quorum Safe and Adaptive.

Maximum Difference


The main Maximum Difference strategy constraint can be formulated as following:

For a given service partition there should never be a difference greater than one in the number of service objects (stateless service instances or stateful service replicas) between two domains.

docs.microsoft.com

Here is a quick example:

Imagine a six nodes cluster with nodes distributed across five fault domain and five upgrade domains (Picture A).

Picture A. Maximum Difference Strategy Cluster Layout
Picture A. Maximum Difference Strategy Cluster Layout

When using Maximum Difference strategy the service instance with five replicas in total would have them distributed as shown on Picture B.

Picture B. Maximum Difference Strategy Replica Distribution
Picture B. Maximum Difference Strategy Replica Distribution

Here CM distributed replicas between fault domains (to ensure maximum availability). The trick here is if we would try to distribute replicas differently – we would always made up with two fault domains that has difference in count of assigned replica greater or equal to two which violates the constraint of Maximum Different strategy.

This example demonstrates the rule of thumb when planning your cluster: Try to keep fault domain hierarchy tree balanced. Unbalanced fault domain tree can result in harder availability management and as result can lead to worse usage of available resources.

Author’s note

Quorum Safe


The main Quorum Safe strategy constraint can be formulated as following:

For a given service partition, replica distribution across domains should ensure that the partition does not suffer a quorum loss.

docs.microsoft.com

Here is a quick example:

Imagine a six nodes cluster with nodes distributed across five fault domain and five upgrade domains (Picture A).

Picture A. Quorum Safe Strategy Cluster Layout
Picture A. Quorum Safe Strategy Cluster Layout

When using Quorum Safe strategy the service instance with five replicas in total would have them distributed as shown on Picture B.

Picture B. Quorum Safe Strategy Replica Distribution
Picture B. Quorum Safe Strategy Replica Distribution

In contrast to Maximum Difference strategy this is possible because the main constraint of Quorum Safe is satisfied i.e. in case of FD0 outage the majority of replicas – the quorum will remain safe.

The Quorum Safe strategy is more flexible than Maximum Difference and allows more variants of replica distribution across the cluster nodes by sacrificing some of fault tolerance characteristics.

The term “quorum” exists only for Stateful Services. That is why Quorum Safe strategy is used for Stateless Services this strategy basically allow more relaxed way of replica distribution.

Author’s note

Adaptive


The Maximum Difference and Quorum Safe placement strategies have their advantages and disadvantages that is why there is the third strategy – Adaptive strategy.

The main Adaptive strategy constraint can be formulated as following:

If the TargetReplicaSetSize is evenly divisible by the number of Fault Domains and the number of Upgrade Domains and the number of nodes is less than or equal to the (number of Fault Domains) x (the number of Upgrade Domains), the Cluster Resource Manager should utilize the “quorum based” logic for that service.

docs.microsoft.com

Generally speaking Adaptive strategy results in usage of either Maximum Difference or Quorum Safe strategies based on the service instance configuration. The Maximum Difference strategy is used by default but if the situation changes them Quorum Safe strategy is used.

Starting from SF v6.2 [*] strategy is the default strategy used by CM.

docs.microsoft.com

Here is a quick example:

Imagine an eight nodes cluster distributed across five fault domain and five upgrade domains (Picture A).

Picture A. Adaptive Strategy Cluster Layout
Picture A. Adaptive Strategy Cluster Layout

When using Adaptive strategy the service instance with TargetReplicaSetSize = 5 would have them distributed as shown on Picture B.

Picture B. Adaptive Strategy Quorum Safe Replica Distribution
Picture B. Adaptive Strategy Quorum Safe Replica Distribution

The Quorum Safe approach is utilized here because service instance meets all the conditions:

Condition #1

TargetReplicaSetSize is evenly divisible by the number of fault and upgrade domains:

\begin{array}{lcl}N_{\text TargetReplicaSetSize} \div N_{\text FD} \rightarrow 5 \div 5 = 1 \\ N_{\text TargetReplicaSetSize} \div N_{\text UD} \rightarrow 5 \div 5 = 1\end{array}

Condition #2

The number of nodes is less that or equal to the number of fault domains multiplied by the number of upgrade domains:

N_{\text Nodes Count} \leq (N_{\text FD} \times N_{\text UD}) \rightarrow 8 \leq (5 \times 5) = 8 \leq 25

If we reduce TargetReplicaSetSize to 4 then CM will start to utilize the Maximum Difference strategy as shown on (Picture C).

Picture C. Adaptive Strategy Maximum Difference Replica Distribution
Picture C. Adaptive Strategy Maximum Difference Replica Distribution

The Maximum Difference approach is utilized here because service instance doesn’t met all the conditions:

Condition #1

TargetReplicaSetSize isn’t evenly divisible by the number of fault and upgrade domains:

\begin{array}{lcl}N_{\text TargetReplicaSetSize} \div N_{\text FD} \rightarrow 4 \div 5 = 0.8 \\ N_{\text TargetReplicaSetSize} \div N_{\text UD} \rightarrow 4 \div 5 = 0.8\end{array}

Node Placement Properties


Placement Properties is the way of describing Node’s capabilities in terms abstract properties. These properties can be anything – they can reflect business purpose of the Node, hardware or software specification. These information is required in order to satisfy need on multi-tired application to host different services on different hardware, software or environment.

By default all Nodes have two predefined placement properties: NodeName and NodeType. These placement properties cannot be edited or changes. All other placement properties are defined as part of the NodeType definition and can be updated by updating cluster configuration.

"nodeTypes": [
  {
    "name": "HostBackendPrimary",
    "placementProperties": {
      "IsBackend": true
    }
  }
]

Example of placement properties definition in ClusterConfig.json

Service Placement Constraints


Placement Constraints – is the way of describing services instance requirements for the nodes hardware, software or business in terms of abstract properties. Combination of Placement Properties and Placement Constraints allows to make sure that replicas of the service instance are hosted in desired environment.

Placement Constraints are represented via conditional expression that support various types (string, int, enumeration) and operations like: equality and comparison.

Service instance placement constraints can be modified at runtime. This possible could result in the massive replica movement around the cluster.

var updateDescription =
  new StatelessServiceUpdateDescription();

updateDescription
  .PlacementConstraints = "IsBackend == True";

await fabricClient.ServiceManager
  .UpdateServiceAsync(
    new Uri("fabric:/app/service"),
    updateDescription);

Example: Updating service instance placement constraints

There is no naming convention for placement properties and placement constraints.

For example the following definition of NodeType is absolutely correct:

"nodeTypes": [
  {
    "name": "host-backend",
    "placementProperties": {
      "is-backend": true
    }
  }
]

The above definition of placement property is-backend is correct and passes the validation done by SF but if we specify it as placement constraint.

var updateDescription =
  new StatelessServiceUpdateDescription();

updateDescription
  .PlacementConstraints = "is-backend == True";

await fabricClient.ServiceManager
  .UpdateServiceAsync(
    new Uri("fabric:/app/service"),
    updateDescription);

No service replicas will be created. There is an issue on GitHub with details investigation regarding this point.

Service Placement Policy


When replicas can be placed only on certain nodes is managed by using placement properties and placement constraints, but there are cases when this isn’t enough. This can be related to physical cluster layout i.e. the nodes in a cluster spans geographically / geo-politically or this can be related to some performance considerations i.e. some workloads should always be collocated to reduce latency.

SF supports four placement policies driven by the knowledge of cluster layout: Invalid Domains, Required Domains, Preferred Primary Domain and Replica Packing.

Invalid Domain


Invalid Domain placement policy is used to specify fault domain to ignore during replica placement. The replica won’t be placed on nodes from specified domain even if there is no other nodes available.

Illustration of Invalid Domain service placement policy
Illustration of Invalid Domain placement policy

There can be multiple invalid domain polices set on the same service in order to restrict replica placement on more than one fault domain – this can be configured at runtime.

Summary
✓ Can be used with Stateless Service?
✓ Can be used with Stateful Service?
✓ Can be configured multiple times?
✓ Can be updated at runtime?

Required Domain


Required Domain placement policy is used to specify fault domain to consider during replica placement. The replica will be placed only on nodes from the specified domain.

Illustration of Required Domain service placement policy
Illustration of Required Domain placement policy

Summary
✓ Can be used with Stateless Service?
✓ Can be used with Stateful Service?
✓ Can be configured multiple times?
✓ Can be updated at runtime?

Preferred Domain


Preferred Domain policy is used to specify fault domain to use when placing stateful service primary replica. The primary replica ends up in this domain when everything is healthy. In case when domain isn’t healthy the Cluster Resource Manager moves primary replica to another location and returns it back to the preferred domain as soon as possible [*].

Illustration of Preferred Domain service placement policy
Illustration of Preferred Domain placement policy

There can be only one preferred domain policy set on the same service at a time. This policy can be configured at runtime.

The policy makes sense only for Stateful Service.

Summary
✗ Can be used with Stateless Service?
✓ Can be used with Stateful Service?
✗ Can be configured multiple times?
✓ Can be updated at runtime?

Packing


Normally replicas are distributed between fault and upgrade domains when cluster is health. However there are situations when multiple replicas of the same partition can be placed into the single domain. This process is called replica packing.

Imagine a health cluster with four fault domains and three replicas distributed across them (Picture A).

Picture A. Illustration of heathy cluster
Picture A. Illustration of healthy cluster

Now something has happen and two of the four fault domains went down. On Picture B you can see that when replica packing is enabled – CM will create replica in one of the remaining fault domains.

Picture B. Illustration of cluster state when two of four fault domains are down and replica packing is enabled
Picture B. Illustration of cluster state when two of four fault domains are down and replica packing is enabled

Hence if replica packing is disabled – then CM won’t build any of new replica until nodes from new fault domains become available (Picture C).

Picture C. Illustration of cluster state when two of four fault domains are down and replica packing is disabled
Picture C. Illustration of cluster state when two of four fault domains are down and replica packing is disabled

A need for replica packing is treated as temporary situation that is why it is enabled by default. This policy can be configured at runtime.

Summary
✓ Can be used with Stateless Service?
✓ Can be used with Stateful Service?
✗ Can be configured multiple times?
✓ Can be updated at runtime?

Service Affinity


All application services are expected to be independent from each other but sometimes there are situations where it is necessary to ensure that all replicas of one service are collocated to replicas of another service [*].

Service Affinity is a directional relationship between two services where replicas of dependent service / child service (A) are collocated to replicas of principle service / parent services (B). The collocation process has several limitations:

  • The collocation isn’t done immediately. There can be notable periods when replicas won’t be collocated because of cluster capacity limitations, replica failures of outages.
  • The parent service can’t define partitioning schema.

In addition to the above limitations service affinity doesn’t support affinity chaining:

A \rightarrow B \rightarrow C

Instead of chaining this kind of relationships should be expressed in form of stars to top most service which would have the same effect:

A \rightarrow B \rightarrow C \sim \begin{array}{lcl} A \rightarrow C \\ B \rightarrow C\end{array}

Besides the default affinity type – None there are two affinity types: NonAlignedAffinity and AlignedAffinity.

NonAlignedAffinity


NonAlignedAffinity – instructs CM to place replicas of child service alongside to replicas of parent service. When using NonAlignedAffinity with stateful service CM doesn’t make difference between primary and secondary replicas.

Illustration of NonAlignedAffinity
Illustration of NonAlignedAffinity

This affinity schema can be configured at runtime.

Summary
✓ Can be used with Stateless Service?
✓ Can be used with Stateful Service?
✓ Can be updated at runtime?

AlignedAffinity


AlignedAffinity – instructs CM to place secondary replicas of child service alongside secondary replicas of parent service and primary replica or child service alongside primary replica of parent service.

Illustration of AlignedAffinity
Illustration of AlignedAffinity

The policy makes sense only for Stateful Service and can be configured at runtime.

Summary
✗ Can be used with Stateless Service?
✓ Can be used with Stateful Service?
✓ Can be updated at runtime?

Application Upgrade


Application Upgrade is the process of upgrading existing Application to a new version of ApplicationType. During the process SF check the ApplicationType version and if it is changed it compares all version of ServiceTypes and for Type that are changed it compares all version of the Code, Config and Data packages.

NOTE

The important moment with the Application Upgrade is while the process is called Application Upgrade it is important to understand that for SF there is no difference in upgrading application from 1.0 to 2.0 or from 2.0 to 1.0.

The upgrade process sequentially iterates over all upgrade domains and simultaneously upgrades all application within the upgrade domain. The application upgrade is a multi stage process that is why it is important to understand that at some point of time some of the services will be already upgraded to a new version while the rest of them will still run on the previous version. This leads to a situation when replicas of the same service are run two different versions of code.

Because any of the Application components can be upgraded Application Upgrade doesn’t always mean modification of service code. That is why when the upgrade process affects only Config or Data packages the replicas aren’t restarted. Instead of this the replica is notified using Code Package events.

NOTE

When replica restart is necessary the application upgrade process configuration allows to specify Force Restart parameter.

History


  • 2023/02/21 – Introduced history section to track post updates.
  • 2023/02/21 – Updated information in NodeType & Node section about mapping of node type to VMSS (Virtual Machine Scale Set) based on the tip in Tweet by Alexander Batishchev and the documentation.

2 thoughts on “Service Fabric Handbook

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s