Jin Shieh and Eamonn Keogh, iSAX: Indexing and Mining Terabyte Sized Time Series, here.
Welcome to the iSAX Page, here.
Welcome to the SAX ( Symbolic Aggregate approXimation) Homepage, here.
TieOut: I would almost like to frame this string matching as finding the pattern strings that do not match the reference text even under approximate matching, In some sense, that is one of the more interesting things to search for e.g., do I have a 100mm swap position that is not a new trade and is not in the reference portfolio. Similarly does the reference portfolio have positions that are unaccounted for in the pattern? Sort of the dual problem to what is described below.
Consider the problem of comparing a new analytic implementation of a pricing/risk algorithm for a security to an existing production implementation. Typically, a representative portfolio of similar securities are selected and “priced” with the new and production implementations and the outputs are compared. If the difference between the old and new computed “prices” are sufficiently small under a variety of different security portfolios, security indicative data, market data, and possibly even econometric data then the new implementation is designated as being “tied out” with the existing production implementation. After “Tieout” the new implementation may be promoted to a production implementation. Pricing and Risk processes monitoring trading and banking over Firm balance sheets of up to several trillion dollars rely on this Tieout process. This note presents a theoretical sequence matching framework, similar to conventional Boyer-Moore, Knuth-Morris-Pratt, and string or regular expression matching, for the automation of the analytical Tieout process for Pricing/Risk algorithms used in Financial Services Firms and Banks.
Recall the classical String Matching setup: we are given a set of symbols in an alphabet, a pattern, and a text where the pattern and text are sequences drawn from the symbols of the alphabet. The String Matching problem is to find the index within the text sequence of a sequence of alphabet symbols ”equal” to the pattern sequence. String Matching can be generalized in several ways: approximate string matching, regular expression matching, or off-line string matching come to mind. Tieout is yet another generalization of standard String Matching applied to security valuation and risk analytics. The key notion is that determining if test runs of a new analytic pricing implementation (pattern) is tied out to a given finite sequence of production valuations (text) is equivalent to a special formulation of String Matching.
Typical applications of String Matching are tools like grep, egrep, fgrep, or diff in text processing or program development. String matching is studied in undergraduate algorithm courses to derive runtime complexity bounds based on the cardinality of the alphabet and lengths of the pattern and text sequences. In the generalization proposed in this note, we are not likely to be very sensitive to the conventional String Matching runtime complexity because the Alphabet and a comparison function between Alphabet elements that we propose to develop are very different from the standard ASCII keyboard alphabet and the comparison function between two ASCII characters. The String Matching framework is useful for comparing a sequence of outputs from production and test analytics runs on a nominal and expected basis. The given output of the production analytics is the String Matching text. The output of the test analytics runs is the pattern. In a Tieout we want to determine two things:
- Nominal Matching: Can the given pattern be matched to the given text efficiently?
- Expected Matching: Assuming we know detailed information about the programs and inputs generating both the given text and the pattern how likely is it that, with different inputs, the expected patterns we may generate will match the given text?
In a simple Tieout String Matching formulation the Alphabet is a finite set of vectors composed of three elements: dates, securities, and prices; the text is a finite sequence of alphabet symbols indexed by the variable j:
(date1, securityA, price1), (date2, securityA, price2), (date3, securityA, price3), …
(datej, securityA, pricej);
and the pattern is a finite sequence of vectors indexed by the variable i:
(newdate1, securityB, newprice1), (newdate2, securityB, newprice2), …
(newdatei, securityB, newpricei).
The pattern Nominally Matches the text at index m < j if:
datem = newdate1,
securityA= securityB, and
abs(pricem – newprice1) < error tolerance constant
datem+1 = newdate2,
securityA= securityB, and
abs(pricem+1 – newprice2) < error tolerance constant
datem+i-1 = newdatei,
securityA= securityB, and
abs(pricem+i-1 – newpricei) < error tolerance constant.
The length of the Nominal Match is the length of the pattern found to have a Nominal Match with the text.
If we assume we are generalizing to an off-line algorithm, where we can preprocess the “text”, locating the matching index does not look very complex, even with the possibility of duplicated date timestamps and securities.
If price sequences price1, price2, …pricej and newprice1, newprice2, … newpricei are model outputs then there is a bunch of implicit information being used to compute the price quantities. The type of security is determining a quant model and the date determines: the as-of-date for the price, the market data for the model to calculate the price, the model parameter calibration, and the model implementation version among other things.
There are obvious extensions of this formulation replacing Securities with Portfolios (the scalar pricej becomes a vector of security prices in the portfolio). Moreover the model outputs could be extended from simply MTM on an as-of-date to include implied spreads, risk, and P&L explanatories. The nominal Tieout matching works the way you would expect.
Additionally, beyond Nominal Matching, we can leverage information about the programs generating the outputs to derive the Expected Matching probabilities. By knowing something about the quantitative modeling, the market data, the market events, the security indicatives, and the code implementing the analytics you are in a position to automatically quantify how likely a Nominal Match can be expected to be found with new program inputs.
For example, if a user of Tieout has a text portfolio of 1 million positions that have been on the books for years, marked to market though an analytic model daily, and nominally matched on a “pattern” of a single position valuation on one date – the Expected Match value is close to zero. Given the limited sample size of the pattern there is possibly little evidence that new inputs will result in Nominal Matches, On the other hand, if the text contains a single position and pattern is a million valuations of the position on different value dates the Expected Match value might be closer to 1. How close to 1, depends on the distribution of the market data and program inputs as well as the implementation of the analytics for the million days of valuations with respect to the entire space of valuation inputs.
If the Tieout pattern and text are identically equal then the Expected Matching probability is 1 (for a given fixed text). The new analytics generates the same output as the production analytics for all known executions on some collection of computers. Perhaps the Expected Matching probability in this case should simply be approaching one since we do not know that the new analytics run on “any” machine will Nominally Match the text. Different machines could give different results running the same code. If we formulate the problem so that the text sequence can be extended with new inputs then we only know that the Expected Matching probability grows monotonically with the length of the Nominal Match.