No, not that kind.
Fighting spam is an arms race.
Fighting spam is an arms race.
The attack landscape is constantly evolving:
Fake accounts, account cloning
Social engineering
Hacked accounts posting spam
Malicious browser extensions
…and many more
Massive automation
Fast engineering triage of breaking attacks
Continuous deployment of code to respond
Sigma is a rule execution engine
Every interaction on FB has associated rules
Posting a status update
Liking a post
Sending a message
Sigma evaluates each interaction to identify and block malicious acts
“A user who is less than a week old posts a photo tagging >= 5 non-friends”
“A user who is less than a week old posts a photo tagging >= 5 non-friends”
The old Sigma rule language, FXL:
If (AgeInHours(Account) < 168 &&
Length(Difference(Tagged(Post),
Friends(Account))) >= 5)
Then [BlockAction, LogRequest]
Else []
Pluses:
Pure functions
Batched data fetches
Minuses became an increasing drag:
Home-grown language
Simple, non-extensible type system
Too much C++ written
Policies can’t inadvertently interact with each other
Rule code can’t crash Sigma
Policies become easy to test in isolation
Policies fetch data from many other systems
Concurrency must be implicit
Push code to production in minutes
Support for interactive development
x
and y
are Facebook users
We want to compute the number of friends that x
and y
have in common
Our ideal for expressiveness:
length (intersect (friendsOf x) (friendsOf y))
We have many varied data sources
Must minimise network roundtrips
Overlap accesses to different services
One “multi-get” access when a single service is used
Cache repeated requests for same/similar data
Need to abstract all this away
{-# LANGUAGE NoImplicitPrelude #-}
module HackyPrelude where
length :: IO [a] -> IO Int
intersect :: Eq a => IO [a] -> IO [a] -> IO [a]
friendsOf :: UserID -> IO [UserID]
This approach “works”, but it’s awful.
Everything gets lifted into IO
:
length :: IO [a] -> IO Int
No more pure, safe code :-(
No concurrency or batching when we execute two friendsOf
actions:
length (intersect (friendsOf x) (friendsOf y))
What if we use MVar
to express this?
length (intersect (friendsOf x) (friendsOf y))
Can’t see what’s happening for all the MVar
machinery!
do
m1 <- newEmptyMVar
m2 <- newEmptyMVar
forkIO (friendsOf x >>= putMVar m1)
forkIO (friendsOf y >>= putMVar m2)
fx <- takeMVar m1
fy <- takeMVar m2
return (length (intersect fx fy))
We need something where:
We can reorder an expression to optimize data fetching
No side effects (other than fetching)
No clutter from concurrency machinery
Inspect a rule to find all data fetches, without executing them
Re-organize fetches to use batching, concurrency, and caching
Execute fetches, wait for results
Resume execution once results come in,
inspecting and re-organizing the next round of fetches
We want to be able to inspect the structure of code without executing it
Let’s think briefly of a monad m
as the “structure” of a computation.
Look at the flipped version of bind:
(=<<) :: Monad m => (a -> m b) -> m a -> m b
It transforms an m a
into an m b
.
To do so, it takes its instructions from a -> m b
.
That function can use a
to decide what m b
it returns.
Thus we can’t analyse this statically (at least not easily):
m b
) will be returned until the code is run.Compare this:
(=<<) :: Monad m => (a -> m b) -> m a -> m b
With these rather similar type signatures:
(<$>) :: Functor f => (a -> b) -> f a -> f b
(<*>) :: Applicative f => f (a -> b) -> f a -> f b
What’s the crucial difference?
Neither Functor
nor Applicative
can affect the “structure” f
that gets returned.
They can only change b
inside f
.
Therefore they’re friendly to static analysis!
newtype Haxl a = Haxl { unHaxl :: IO (Result a) }
data Result a = Done a
| Blocked (Haxl a)
instance Monad Haxl where
return a = Haxl (return (Done a))
m >>= k = Haxl $ do
a <- unHaxl m
case a of
Done a -> unHaxl (k a)
Blocked r -> return (Blocked (r >>= k))
data Result a = Done a
| Blocked (Haxl a)
A computation has either
Completed successfully
Blocked on a pending data fetch
Haxl
monad?Alternate names for this construction:
Resumption monad, concurrency monad
Modern name: free monad (ours has a few bells and whistles)
What?
We are separating the representation of the computation from the way it will be run
Haxl
and Result
give us an abstract syntax tree (AST) that we can manipulate before we execute the code
countCommonFriends :: UserID -> UserID -> Haxl Int
countCommonFriends x y = do
fx <- friendsOf x
fy <- friendsOf y
return (length (intersect fx fy))
We run out of AST to explore as soon as we hit the first friendsOf
, thanks to use of >>=
(via do
desugaring).
Can’t see the second one, so no concurrent execution.
Can we rewrite our function to make more of the fetches statically visible?
Yes!
countCommonFriends x y =
length <$> (intersect <$> friendsOf x <*> friendsOf y)
instance Applicative Haxl where
pure = return
Haxl f <*> Haxl a = Haxl $ do
r <- f
case r of
Done f' -> do
ra <- a
case ra of
Done a' -> pure (Done (f' a'))
Blocked a' -> pure (Blocked (f' <$> a'))
Blocked f' -> do
ra <- a
case ra of
Done a' -> pure (Blocked (f' <*> pure a'))
Blocked a' -> pure (Blocked (f' <*> a'))
Two complementary approaches:
Alternate Haxl-friendly prelude
Fancy language support (ApplicativeDo
)
{-# LANGUAGE ApplicativeDo #-}
Facebook-developed language extension
When ApplicativeDo
is turned on:
GHC will use a different method for desugaring do
-notation
Attempts to use the Applicative
operator <*>
as far as possible, along with fmap
and join
Makes it possible to use do
-notation for types that are Applicative
but not Monad
Example:
do
x <- a
y <- b
return (f x y)
Translates to:
f <$> a <*> b
Consider our example fragment of code:
length (intersect (friendsOf x) (friendsOf y))
What if this was executed as two fetches?
And what if x
== y
?
If the two queries returned different results, we’d get bogus behaviour
For this application, caching matters for performance and correctness
Several “core data” systems at Facebook
Memcache
TAO
MySQL
Other data sources also involved in Sigma
Haxl core doesn’t know about data sources
Data sources are hot-pluggable (!)
A data source interacts with Haxl core in 3 ways
-- Core abstraction
class DataSource req where
{- ... -}
-- An example data source
data ExampleReq a where
CountAardvarks :: String -> ExampleReq Int
ListWombats :: Id -> ExampleReq [Id]
deriving Typeable
-- Fetch data generically
dataFetch :: DataSource req => req a -> Haxl a
dataFetch :: DataSource req => req a -> Haxl a
Haxl core has to manage requests submitted with dataFetch
Once an entire round of fetching has stopped making progress, we retrieve the pending requests
class DataSource req where
fetch :: [BlockedFetch req] -> IO ()
data BlockedFetch req =
forall a . BlockedFetch (req a) (MVar a)
Remember: Haxl core is agnostic to data sources
We use dynamic typing (Typeable
) to manage them
-- hack to support parameterised types
class Eq1 req where
eq1 :: req a -> req a -> Bool
class (Typeable1 req, Hashable1 req, Eq1 req) =>
DataSource req where
data DataState req
fetch :: DataState req
-> [BlockedFetch req]
-> IO ()
DataState
is an associated type
A lot of other demanding and intricate work went into making Haxl run effectively at scale.
As of June 2015, Haxl-powered Sigma was handling over a million RPS.
For many more juicy details, see the article on code.facebook.com.
Code is open sourced on Hackage as haxl.
Haxl inspired Twitter’s Stitch project.
Facebook is famous for having been built in PHP.
As our code base grew to millions of lines in size:
PHP’s dynamic typing became increasingly troublesome
It became harder to understand and change code
“What type is this variable? I have no idea!”
We observed the same growing pains in Javascript.
In between static and dynamic type systems lies the interesting middle ground of gradual types.
Some variables are given explicit types, which can be checked statically
Others have types checked at runtime
Languages that start all-dynamic can acquire gradual typing
…as can languages that start all-static!
At Facebook, we developed Hack: hacklang.org
Addresses the productivity and reliability problems of PHP
(lots of PHP!)
A gradually typed language
Add type annotations to existing code incrementally
IDE support: type-sensitive autocomplete, live typechecking and errors
Well over 90% of Facebook PHP code is now statically typed via Hack.
Some code is still too tricky to statically type.
PHP programmers rightly value rapid feedback.
With Hack, this short cycle time had to be preserved.
Implications are huge:
Type checking algorithms must be fast and mostly incremental
A persistent daemon maintains type system state
Filesystem change notifications lead to “instant” type checking
Typical typechecking time is 200ms (outliers up to 1000ms) for millions of lines of code
The Hack type checker is built in OCaml.
OCaml had, until recently, zero support for multiple CPUs.
But multiple CPUs can speed up type checking, and we care greatly about speed …
Following the success of Hack at Facebook, we started work on the same problem space of types in Javascript.
The result is the Flow type checker: flowtype.org
Similar ergonomics:
Heavily based on automatic inference
Fast incremental checks
TypeScript
Pessimistically assumes most JS code is statically untypeable
Only typechecks code that is explicitly annotated
Designed mainly for IDE tooling; unsound whenever convenient
Flow
Optimistically assumes that most JS is statically typable
Uses type inference to fill in missing annotations
Designed to find errors; takes soundness far more seriously
Somewhat similar ideas, different tradeoffs
Pros for TS:
Really slick Visual Studio integration
Support for typing of third-party js (but all those .ts
files are a pain)
One-stop type checking and transformation
Pros for Flow:
Path-sensitive type narrowing
Better type checking of idiomatic js
Really fast
Gets non-nullability right (new TS support is unsound)