{"title": "Self-attention with Functional Time Representation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 15915, "page_last": 15925, "abstract": "Sequential modelling with self-attention has achieved cutting edge performances \nin natural language processing. With advantages in model flexibility, computation complexity and interpretability, self-attention is gradually becoming a key component in event sequence models. However, like most other sequence models, self-attention does not account for the time span between events and thus captures sequential signals rather than temporal patterns. \nWithout relying on recurrent network structures, self-attention recognizes event orderings via positional encoding. To bridge the gap between modelling time-independent and time-dependent event sequence, we introduce a functional feature map that embeds time span into high-dimensional spaces. By constructing the associated translation-invariant time kernel function, we reveal the functional forms of the feature map under classic functional function analysis results, namely Bochner's Theorem and Mercer's Theorem. We propose several models to learn the functional time representation and the interactions with event representation. These methods are evaluated on real-world datasets under various continuous-time event sequence prediction tasks. The experiments reveal that the proposed methods compare favorably to baseline models while also capture useful time-event interactions.", "full_text": "Self-attention with Functional Time Representation\n\nLearning\n\nDa Xu\u21e4, Chuanwei Ruan\u21e4, Sushant Kumar , Evren Korpeoglu , Kannan Achan\n\n{Da.Xu,Chuanwei.Ruan,EKorpeoglu,SKumar4,KAchan}@walmartlabs.com\n\nWalmart Labs\n\nCalifornia, CA 94086\n\nAbstract\n\nSequential modelling with self-attention has achieved cutting edge performances\nin natural language processing. With advantages in model \ufb02exibility, computa-\ntion complexity and interpretability, self-attention is gradually becoming a key\ncomponent in event sequence models. However, like most other sequence models,\nself-attention does not account for the time span between events and thus captures\nsequential signals rather than temporal patterns. Without relying on recurrent\nnetwork structures, self-attention recognizes event orderings via positional encod-\ning. To bridge the gap between modelling time-independent and time-dependent\nevent sequence, we introduce a functional feature map that embeds time span into\nhigh-dimensional spaces. By constructing the associated translation-invariant time\nkernel function, we reveal the functional forms of the feature map under classic\nfunctional function analysis results, namely Bochner\u2019s Theorem and Mercer\u2019s\nTheorem. We propose several models to learn the functional time representation\nand the interactions with event representation. These methods are evaluated on\nreal-world datasets under various continuous-time event sequence prediction tasks.\nThe experiments reveal that the proposed methods compare favorably to baseline\nmodels while also capturing useful time-event interactions.\n\n1\n\nIntroduction\n\nAttention mechanism, which assumes that the output of an event sequence is relevant to only part\nof the sequential input, is fast becoming an essential instrument for various machine learning tasks\nsuch as neural translation [1], image caption generation [25] and speech recognition [4]. It works\nby capturing the importance weights of the sequential inputs successively and is often used as an\nadd-on component to base models such as recurrent neural networks (RNNs) and convolutional\nneural networks (CNNs) [3]. Recently, a seq-to-seq model that relies only on an attention module\ncalled \u2019self-attention\u2019 achieved state-of-the-art performance in neural translation [20]. It detects\nattention weights from input event sequence and returns the sequence representation. Without\nrelying on recurrent network structures, self-attention offers appealing computational advantage since\nsequence processing can be fully parallelized. Key to the original self-attention module is positional\nencoding, which maps discrete position index {1, . . . , l} to a vector in Rd and can be either \ufb01xed or\njointly optimized as free parameters. Positional encoding allows self-attention to recognize ordering\ninformation. However, it also restricts the model to time-independent or discrete-time event sequence\nmodelling where the difference in ordered positions can measure distance between event occurrences.\nIn continuous-time event sequences, the time span between events often has signi\ufb01cant implications\non their relative importance for prediction. Since the events take place aperiodically, there are gaps\nbetween the sequential patterns and temporal patterns. For example, in user online behaviour analysis,\n\n\u21e4The two \ufb01rst authors contribute equally to this work.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe dwelling time often indicates the degree of interest on the web page while sequential information\nconsiders only the ordering of past browsing. Also, detecting interactions between temporal and event\ncontexts is an increasingly important topic in user behavioural modelling [12]. In online shopping,\ntransactions usually indicate long-term interests, while views are often short-termed. Therefore future\nrecommendations should depend on both event contexts and the timestamp of event occurrences.\nTo effectively encode the event contexts and feed them to self-attention models, the discrete events\nare often embedded into a continuous vector space [2]. After training, the inner product of their\nvector representations often re\ufb02ect relationship such as similarity. In ordinary self-attention, the event\nembeddings are often added to positional encoding to form an event-position representation [20].\nTherefore, it is natural and straightforward to think about replacing positional encoding with some\nfunctional mapping that embeds time into vector spaces.\nHowever, unlike positional encoding where representations are needed for only a \ufb01nite number\nof indices, time span is a continuous variable. The challenges of embedding time are three folds.\nFirstly, a suitable functional form that takes time span as input needs to be identi\ufb01ed. Secondly, the\nfunctional form must be properly parameterized and can be jointly optimized as part of the model.\nFinally, the embedded time representation should respect the function properties of time itself. To\nbe speci\ufb01c, relative time difference plays far more critical roles than absolute timestamps, for both\ninterpolation or extrapolation purposes in sequence modelling. Therefore, the relative positions of\ntwo time representations in the embedding space should be able to re\ufb02ect their temporal difference.\nThe contributions of our paper are concluded below:\n\n\u2022 We propose the translation-invariant time kernel which motivates several functional forms of\ntime feature mapping justi\ufb01ed from classic functional analysis theories, namely Bochner\u2019s\nTheorem [13] and Mercer\u2019s Theorem [15]. Compared with the other heuristic-driven time to\nvector methods, our proposals come with solid theoretical justi\ufb01cations and guarantees.\n\n\u2022 We develop feasible time embeddings according to the time feature mappings such that\nthey are properly parameterized and compatible with self-attention. We further discuss the\ninterpretations of the proposed time embeddings and how to model their interactions with\nevent representations under self-attention.\n\n\u2022 We evaluate the proposed methods qualitatively and quantitatively and compare them with\nseveral baseline methods in various event prediction tasks with several datasets (two are\npublic). We speci\ufb01cally compare with RNNs and self-attention with positional encoding to\ndemonstrate the superiority of the proposed approach for continuous-time event sequence\nmodelling. Several case studies are provided to show the time-event interactions captured\nby our model.\n\n2 Related Work\n\nThe original self-attention uses dot-product attention [20], de\ufb01ned via:\n\nAttn(Q, K, V) = softmax\u21e3 QK>\n\npd \u2318V,\n\n(1)\n\nwhere Q denotes the queries, K denotes the keys and V denotes the values (representations) of\nevents in the sequence. Self-attention mechanism relies on the positional encoding to recognize\nand capture sequential information, where the vector representation for each position, which is\nshared across all sequences, is added or concatenated to the corresponding event embeddings. The\nabove Q, K and V matrices are often linear (or identity) projections of the combined event-position\nrepresentations. Attention patterns are detected through the inner products of query-key pairs, and\npropagate to the output as the weights for combining event values. Several variants of self-attention\nhave been developed under different use cases including online recommendation [10], where sequence\nrepresentations are often given by the attention-weighted sum of event embeddings.\nTo deal with continuous time input in RNNs, a time-LSTM model was proposed with modi\ufb01ed gate\nstructures [27]. Classic temporal point process also allows the usage of inter-event time interval as\ncontinuous random variable in modelling sequential observations [26]. Several methods are proposed\nto couple point process with RNNs to take account of temporal information [23, 22, 14, 6]. In these\nwork, however, inter-event time intervals are directly appended to hidden event representations as\ninputs to RNNs. A recent work proposes a time-aware RNN with time encoding [12].\n\n2\n\n\fThe functional time embeddings proposed in our work have sound theoretical justi\ufb01cations and\ninterpretations. Also, by replacing positional encoding with time embedding we inherit the advantages\nof self-attention such as computation ef\ufb01ciency and model interpretability. Although in this paper we\ndo not discuss how to adapt the function time representation to other settings, the proposed approach\ncan be viewed as a general time embedding technique.\n\n3 Preliminaries\n\nEmbedding time from an interval (suppose starting from origin) T = [0, tmax] to Rd is equivalent\nto \ufb01nding a mapping : T ! Rd. Time embeddings can be added or concatenated to event\nembedding Z 2 RdE, where Zi gives the vector representation of event ei, i = 1, . . . , V for a total\nof V events. The intuition is that upon concatenation of the event and time representations, the dot\nproduct between two time-dependent events (e1, t1) and (e2, t2) becomes\u21e5Z1, (t1)\u21e40\u21e5Z2, (t2)\u21e4 =\n\u2326Z1, Z2\u21b5 +\u2326(t1), (t2)\u21b5. Since hZ1, Z2i represents relationship between events, we expect that\n\u2326(t1), (t2)\u21b5 captures temporal patterns, specially those related with the temporal difference t1 t2\nas we discussed before. This suggests formulating temporal patterns with a translation-invariant\nkernel K with as the feature map associated with K.\nLet the kernel be K : T \u21e5 T ! R where K(t1, t2) := h(t1), (t2)i and K(t1, t2) = (t1 \nt2),8t1, t2 2 T for some : [tmax, tmax] ! R. Here the feature map captures how kernel\nfunction embeds the original data into a higher dimensional space, so the idea of introducing the time\nkernel function is in accordance with our original goal. Notice that the kernel function K is positive\nsemide\ufb01nite (PSD) since we have expressed it with a Gram matrix. Without loss of generality we\nassume that is continuous, which indicates that K is translation-invariant, PSD and also continuous.\nSo the task of learning temporal patterns is converted to a kernel learning problem with as feature\nmap. Also, the interactions between event embedding and time can now be recognized with some\nother mappings asZ, (t) 7! fZ, (t), which we will discuss in Section 6. By relating time\n\nembedding to kernel function learning, we hope to identify with some functional forms which\nare compatible with current deep learning frameworks, such that computation via bask-propagation\nis still feasible. Classic functional analysis theories provides key insights for identifying candidate\nfunctional forms of . We \ufb01rst state Bochner\u2019s Theorem and Mercer\u2019s Theorem and brie\ufb02y discuss\ntheir implications.\nTheorem 1 (Bochner\u2019s Theorem). A continuous, translation-invariant kernel K(x, y) = (x y)\non Rd is positive de\ufb01nite if and only if there exists a non-negative measure on R such that is the\nFourier transform of the measure.\nThe implication of Bochner\u2019s Theorem is that when scaled properly we can express K with:\n\nK(t1, t2) = (t1, t2) =ZR\n\nei!(t1t2)p(!)d! = E!\u21e5\u21e0!(t1)\u21e0!(t2)\u21e4\u21e4,\n\nwhere \u21e0!(t) = ei!t. Since the kernel K and the probability measure p(!) are real, we extract the real\npart of (2) and obtain:\n(3)\n\nK(t1, t2) = E!\u21e5 cos(!(t1 t2))\u21e4 = E!\u21e5 cos(!t1) cos(!t2) + sin(!t1) sin(!t2)\u21e4.\n\nWith this alternate expression of kernel function K, the expectation term can be approximated by\nMonte Carlo integral [17]. Suppose we have d samples !1, . . . ,! d drawn from p(!), an estimate of\ndPd\nour kernel K(t1, t2) can be constructed by 1\ni=1 cos(!it1) cos(!it2) + sin(!it1) sin(!it2). As a\nconsequence, Bochner\u2019s Theorem motivates the \ufb01nite dimensional feature map to Rd via:\n\n(2)\n\nt 7! Bd (t) :=r 1\n\nd\u21e5 cos(!1t), sin(!1t), . . . , cos(!dt), sin(!dt)\u21e4,\n\nsuch that K(t1, t2) \u21e1 limd!1\u2326Bd (t1), Bd (t2)\u21b5.\n\nSo far we have obtained a speci\ufb01c functional form for , which is essentially a random projection\nonto the high-dimensional vector space of i.i.d random variables with density given by p(!), where\neach coordinate is then transformed by trigonometric functions. However, it is not clear how to\n\n3\n\n\fBochner\u2019s\n\nFeature maps speci\ufb01ed by\n\n\u21e52i(t), 2i+1(t)\u21e4\nh cos!i(\u00b5)t, sin!i(\u00b5)ti\nh cosg\u2713(!i)t, sing\u2713(!i)ti Bochner\u2019s\n\u21e5 cos( \u02dc!it), sin( \u02dc!it)\u21e4\n\u21e5pc2i,k cos(!jt),\npc2i+1,k sin(!jt)\u21e4\n\nBochner\u2019s\n\nMercer\u2019s\n\n\u00b5: location-scale\nparameters speci\ufb01ed\nfor the\nreparametrization\ntrick.\n\u2713: parameters for the\ninverse CDF\nF 1 = g\u2713.\n{\u02dc!}d\nsamples under\nnon-parametric inverse\nCDF transformation.\ni=1: the Fourier\n{ci,k}2d\ncoef\ufb01cients of\ncorresponding K!j ,\nfor j = 1, . . . , k.\n\ni=1: transformed\n\nInterpretations of !\n!i(\u00b5): converts the ith\nsample (drawn from\nauxiliary distribution)\nto target distribution\nunder location-scale\nparameter \u00b5.\n!i: the ith sample\ndrawn from the\nauxiliary distribution.\n\u02dc!i: the ith sample of\nthe underlying\ndistribution p(!) in\nBochner\u2019s Theorem.\n!j: the frequency for\nkernel function K!j\n(can be parameters).\n\nOrigin\n\nParameters\n\nTable 1: The proposed functional forms of the feature map = [ . . . , 2i(t), 2i+1(t), . . . ] motivated\nfrom Bochner\u2019s and Mercer\u2019s Theorem, with explanations of free parameters and interpretation of !.\n\nsample from the unknown distribution of !. Otherwise we would already have K according to\nthe Fourier transformation in (2). Mercer\u2019s Theorem, on the other hand, motivates a deterministic\napproach.\nTheorem 2 (Mercer\u2019s Theorem). Consider the function class L2(X , P) where X is compact. Sup-\npose that the kernel function K is continuous with positive semide\ufb01nite and satisfy the condition\nRX\u21e5X K2(x, z)dP(x)dP(y) \uf8ff 1, then there exist a sequence of eigenfunctions (i)1i=1 that form an\northonormal basis of L2(X , P), and an associated set of non-negative eigenvalues (ci)1i=1 such that\n(4)\n\ncii(x)i(z),\n\nK(x, z) =\n\n1Xi=1\n\nwhere the convergence of the in\ufb01nite series holds absolutely and uniformly.\n\nMercer\u2019s Theorem provides intuition on how to embed instances from our functional domain\nT into the in\ufb01nite sequence space `2(N). To be speci\ufb01c, the mapping can be de\ufb01ned via\n\nt 7! M(t) := \u21e5pc11(t),pc22(t), . . .\u21e4, and Mercer\u2019s Theorem guarantees the convergence\nof\u2326M(t1), M(t2)\u21b5 !K (t1, t2).\n\nThe two theorems have provided critical insight behind the functional forms of feature map .\nHowever, they are still not applicable. For the feature map motivated by Bochner\u2019s Theorem, let alone\nthe infeasibility of sampling from unknown p(!), the use of Monte Carlo estimation brings other\nuncertainties, i,e how many samples are needed for a decent approximation. As for the feature map\nfrom Mercer\u2019s Theorem, \ufb01rst of all, it is in\ufb01nite dimensional. Secondly, it does not possess speci\ufb01c\nfunctional forms without making additional assumptions. The solutions to the above challenges are\ndiscussed in the next two sections.\n\n4 Bochner Time Embedding\n\nA practical solution to effectively learn the feature map suggested by Bochner\u2019s Theorem is to\nuse the \u2019reparameterization trick\u2019 [11]. Reparameterization trick provides ideas on sampling from\ndistributions by using auxiliary variable \u270f which has known independent marginal distribution p(\u270f).\nFor \u2019location-scale\u2019 family distribution such as Gaussian distribution, suppose ! \u21e0 N (\u00b5, ), then\nwith the auxiliary random variable \u270f \u21e0 N (0, 1), ! can be reparametrized as \u00b5 + \u270f. Now samples of\n! are transformed from samples of \u270f, and the free distribution parameters \u00b5 and can be optimized\n\n4\n\n\fas part of the whole learning model. With Gaussian distribution, the feature map Bd suggested by\nBochner\u2019s Theorem can be effectively parameterized by \u00b5 and , which are also the inputs to the\nfunctions !i(\u00b5, ) that transforms the ith sample from the auxiliary distribution to a sample of target\ndistribution (Table 1). A potential concern here is that the \u2019location-scale\u2019 family may not be rich\nenough to capture the complexity of temporal patterns under Fourier transformation. Indeed, the\nFourier transform of a Gaussian function in the form of f (x) \u2318 eax2 is another Gaussian function.\nAn alternate approach is to use inverse cumulative distribution function CDF transformation.\nLet F 1 be the inverse CDF of some probability distribution (if exists), then for \u270f sampled from\nuniform distribution, we can always use F 1(\u270f) to generate samples of the desired distribution.\nThis suggests parameterizing the inverse CDF function as F 1 \u2318 g\u2713(.) with some functional\napproximators such as neural networks or \ufb02ow-based CDF estimation methods including normalizing\n\ufb02ow [18] and RealNVP [5] (see the Appendix for more discussions). As a matter of fact, if the\nsamples are \ufb01rst drawn (following either transformation method) and held \ufb01xed during training, we\ni=1 sampled from auxiliary distribution,\ncan consider using non-parametric transformations. For {!i}d\nlet \u02dc!i = F 1(!i), i = 1, 2,\u00b7, d, for some non-parametric inverse CDF F 1. Since !i are \ufb01xed,\nlearning F 1 amounts to directly optimize the transformed samples {\u02dc!}d\nIn short, the Bochner\u2019s time feature maps can be realized with reparametrization trick or paramet-\nric/nonparametric inverse CDF transformation. We refer to them as Bochner time encoding. In Table\n1, we conclude the functional forms for Bochner time encoding and provides explanations of the\nfree parameters as well as the meanings of !. A sketched visual illustration is provided in the left\npanel of Table 2. Finally, we provide the theoretical justi\ufb01cation that with samples drawn from the\ncorresponding distribution p(w), the Monte Carlo approximation converges uniformly to the kernel\nfunction K with high probability. The upper bound stated in Claim 1 provides some guidelines for\nthe number of samples needed to achieve a good approximation.\nClaim 1. Let p(!) be the corresponding probability measure stated in Bochner\u2019s Theorem for kernel\nfunction K. Suppose the feature map is constructed as described above using samples {!i}d\ni=1, we\nhave\n\ni=1 as free parameters.\n\n(5)\n\nt1,t22TBd (t1)0Bd (t2) K (t1, t2) \u270f\u2318 \uf8ff 4pr tmax\nPr\u21e3 sup\n\np is the second momentum with respect to p(!).\n\n\u270f\n\nexp\u21e3d\u270f2\n32 \u2318,\n\nwhere 2\n\nThe proof is provided in supplement material.\n\nTherefore, we can use \u2326 1\nhave supt1,t22TBd (t1)0Bd (t2) K (t1, t2) <\u270f with any probability.\n\n samples (at the order of hundreds if \u270f \u21e1 0.1) from p(!) to\n\n\u270f2 log 2\n\nptmax\n\n\u270f\n\n5 Mercer Time Embedding\n\n! , 1\n\nMercer\u2019s Theorem solves the challenge of embedding time span onto a sequence space, however, the\nfunctional form of is unknown and the space is in\ufb01nite-dimensional. To deal with the \ufb01rst problem,\nwe need to make an assumption on the periodic properties of K to meet the condition in Proposition\n1, which states a fairly straightforward formulation of the functional mapping (.).\nProposition 1. For kernel function K that is continuous, PSD and translation-invariant with K =\n (t1t2), suppose is a even periodic function with frequency !, i.e (t) = (t) and t+ 2k\n! =\n (t) for all t 2 [ 1\n! ] and integers k 2 Z, the eigenfunctions of K are given by the Fourier basis.\nThe proof of Proposition 1 is provided in supplement material.\nNotice that in our setting the kernel K is not necessarily periodic. Nonetheless we may assume that\nthe temporal patterns can be detected from a \ufb01nite set of periodic kernels K! : T \u21e5 T ! R,! 2\n{!1, . . . ,! k} , where each K! is a continuous, translation-invariant and PSD kernel further endowed\nwith some frequency !. In other words, we project the unknown kernel function K onto a set of\nperiodic kernels who have the same properties as K.\nAccording to Proposition 1 we immediately see that for each periodic kernel K!i the eigenfunc-\ntions stated in Mercer\u2019s Theorem are given by: 2j(t) = 1, 2j(t) = cos j\u21e1t\n!i, 2j+1(t) =\n\n5\n\n\f!i for j = 1, 2, . . ., with ci, i = 1, 2, . . . giving the corresponding Fourier coef\ufb01cients. There-\n\nsin j\u21e1t\nfore we have the in\ufb01nite dimensional Mercer\u2019s feature map for each K!:\n! ,pc2j+1 sin j\u21e1t\n\nt 7! M! (t) =hpc1, . . . ,pc2j cos j\u21e1t\n\nwhere we omit the dependency of all cj on ! for notation simplicity.\nOne signi\ufb01cant advantage of expressing K! by Fourier series is that they often have nice truncation\nproperties, which allows us to use the truncated feature map without loosing too much information. It\nhas been shown that under mild conditions the Fourier coef\ufb01cients cj decays exponentially to zero\n[21], and classic approximation theory guarantees a uniform convergence bound for truncated Fourier\nseries [9] (see Appendix for discussions). As a consequence, we propose to use the truncated feature\nmap M!,d(t), and thus the complete Mercer\u2019s time embedding is given by:\n\n! , . . .i,\n\nt 7! Md =\u21e5M!1,d(t), . . . , M!k,d(t)\u21e4>.\n\n(6)\nTherefore Mercer\u2019s feature map embeds the periodic kernel function into the high-dimensional\nspace spanned by truncated Fourier basis under certain frequency. As for the unknown Fourier\ncoef\ufb01cients cj, it is obvious that learning the kernel functions K! is in form equivalent to learning\ntheir corresponding coef\ufb01cients. To avoid unnecessary complications, we treat cj as free parameters.\nLast but not least, we point out that the set of frequencies {!1, . . . ,! k} that speci\ufb01es each periodic\nkernel function should be able to cover a broad range of bandwidths in order to capture various signals\nand achieve good approximation. They can be either \ufb01xed or jointly optimized as free parameters. In\nour experiments they lead to similar performances if properly initialized, such as using a geometrically\nsequence: !i = !max (!max !min)i/k, i = 1, . . . , k, to cover [!min,! max] with a focus on\nhigh-frequency regions. The sketched visual illustration is provided in the right panel of Table 2.\n\nBochner time embedding\n\nMercer time embedding\n\nTable 2: Sketched visual illustration of the proposed Bochner and Mercer time embedding (Bd (t) and\nM!,d(t)) for a speci\ufb01c t = ti with d = 3. In right panel the scale of sine and cosine waves decreases\nas their frequency gets larger, which is a common phenomenon for Fourier series.\n\n6 Time-event Interaction\n\nLearning time-event interaction is crucial for continuous-time event sequence prediction. After\nembedding time span into \ufb01nite-dimensional vector spaces, we are able to directly model interactions\nusing time and event embeddings. It is necessary to \ufb01rst project the time and event representations\n\nonto the same space. For an event sequence(e1, t1), . . . , (eq, tq) we concatenate the event and\ntime representations into [Z, ZT ] where Z =\u21e5Z1, . . . , Zq\u21e4, ZT =\u21e5(t1), . . . , (tq)\u21e4 and project\n\nthem into the query, key and value spaces. For instance, to consider only linear combinations of event\nand time representations in query space, we can simply use Q = [Z, ZT ]W0 + b0. To capture non-\nlinear relations hierarchically, we may consider using multilayer perceptrons (MLP) with activation\nfunctions, such as\n\nQ = ReLU[Z, ZT ]W0 + b0W1 + b1,\n\n6\n\n\fwhere ReLU(.) is the recti\ufb01ed linear unit. Residual blocks can also be added to propagate useful\nlower-level information to \ufb01nal output. When predicting the next time-dependent event (eq+1, tq+1),\nto take account of the time lag between each event in input sequence and target event we let\n\u02dcti = tq+1 ti, i = 1, . . . , q and use (\u02dcti) as time representations. This does not change the relative\ntime difference between input events, i.e. \u02dcti \u02dctj = ti tj for i, j = 1, . . . , q, and now the attention\nweights and prediction becomes a function of next occurrence time.\n\n7 Experiment and Result\n\nWe evaluate the performance of the proposed time embedding methods with self-attention on several\nreal-world datasets from various domains. The experiemnts aim at quantitatively evaluating the\nperformance of the four time embedding methods, and comparing them with baseline models.\n\n7.1 Data Sets\n\n\u2022 Stack Over\ufb02ow2 dataset records user\u2019s history awarded badges in a question-answering\nwebsite. The task is to predict the next badge the user receives, as a classi\ufb01cation task.\n\u2022 MovieLens3 is a public dataset consists of movie rating for benchmarking recommendations\nalgorithms [7]. The task is to predict the next movie that the user rates for recommendation.\n\u2022 Walmart.com dataset is obtained from Walmart\u2019s online e-commerce platform in the U.S4.\nIt contains the session-based search, view, add-to-cart and transaction information with\ntimestamps for each action from selected users. The task is to predict the next-view item for\nrecommendation. Details for all datasets are provided in supplemnetary meterial.\n\nData preparation - For fair comparisons with the baselines, on the MovieLens dataset we follow the\nsame prepossessing steps mentioned in [10]. For users who rated at least three movies, we use their\nsecond last rating for validation and their last rated movie for testing. On the stack over\ufb02ow dataset\nwe use the same \ufb01ltering procedures described in [12] and randomly split the dataset on users into\ntraining (80%), validation (10%) and test (10%). On the Walmart.com dataset we \ufb01lter out users with\nless than ten activities and products that interacted with less than \ufb01ve users. The training, validation\nand test data are splited based on session starting time chronically.\n\n7.2 Baselines and Model con\ufb01gurations\nWe compare the proposed approach with LSTM, the time-aware RNN model (TimeJoint) [12] and\nrecurrent marked temporal point process model (RMTPP) [6] on the Stack Over\ufb02ow dataset. We\npoint out that the two later approaches also utilize time information. For the above three models, we\nreuse the optimal model con\ufb01gurations and metrics (classi\ufb01cation accuracy) reported in [12] for the\nsame Stack Over\ufb02ow dataset.\nFor the recommendation tasks on MovieLens dataset, we choose the seminal session-based RNN\nrecommendation model (GRU4Rec) [8], convolutional sequence embedding method (Caser) [19]\nand translation-based recommendation model (TransRec) [10] as baselines. These position-aware\nsequential models have been shown to achieve cutting-edge performances on the same MovieLens\ndataset [10]. We also reuse the metrics - top K hitting rate (Hit@K) and normalized discounted\ncumulative gain (NDCG@K), as well as the optimal model con\ufb01gurations reported in [10].\nOn the Walmart.com dataset, other than GRU4Rec and TransRec, we compare with an attention-based\nRNN model RNN+attn. The hyper-parameters of the baselines are tuned for optimal performances\naccording to the Hit@10 metric on the validation dataset. The outcomes are provided in Table 3.\nAs for the proposed time embedding methods, we experimented on the Bochner time embedding\nwith the reparameterization trick using normal distribution (Bochner Normal), the parametric inverse\nCDF transformation (Bochner Inv CDF) with MLP, MLP + residual block, masked autoregressive\n\ufb02ow (MAF) [16] and non-volume preserving transformations (NVP) [5], the non-parametric inverse\nCDF transformation (Bochner Non-para), as well as the Mercer time embedding. For the purpose\n\n2https://archive.org/details/stackexchange\n3https://grouplens.org/datasets/movielens/1m/\n4https://www.walmart.com\n\n7\n\n\fof ablation study, we compare with the original positional encoding self-attention (PosEnc) for all\ntasks (Table 3). We use d = 100 for both Bochner and Mercer time embedding, with the sensitivity\nanalysis on time embedding dimensions provided in appendix. We treat the dimension of Fourier\nbasis k for Mercer time embedding as hyper-parameter, and select from {1, 5, 10, 15, 20, 25, 30}\naccording to the validation Hit@10 as well. When reporting the results in Table 3, we mark the\nmodel con\ufb01guration that leads to the optimal validation performance for each of our time embedding\nmethods. Other con\ufb01gurations and training details are provided in appendix.\n\n7.3 Experimental results\n\nMethod\n\nAccuracy\n\ncon\ufb01g\n\nMethod\nHit@10\n\nNDCG@10\n\ncon\ufb01g\n\nMethod\nHit@5\n\nNDCG@5\nHit@10\n\nNDCG@10\n\ncon\ufb01g\n\nBochner\nNon-para\n46.27(0.29)\n\n-\n82.86(.22)\n60.83(.15)\n\n-\n9.25(.15)\n7.34(.12)\n13.16(.41)\n11.36(.27)\n\nMercer\n\n46.83(0.20)\nk = 10\n\n-\n82.92 (.17)\n61.67 (.11)\nk = 5\n\n-\n10.92(.13)\n8.90(.11)\n14.94(.31)\n12.81(.22)\nk = 25\n\nLSTM\n\nTimeJoint\n\nStack Over\ufb02ow\nRMTPP\n\n46.03(.21)\n\n46.30(.23)\n\n46.23(.24)\n\nPosEnc\n\nBochner\nInv CDF\n44.03(.33) 44.89(.46) 44.67(.38)\n\nBochner\nNormal\n\nNVP\n\nMovieLens-1m\n\nGRU4Rec\n75.01(.25)\n55.13(.14)\n\nCaser\n\n78.86(.22)\n55.38(.15)\n\nTransRec\n64.15(.27)\n39.72(.16)\n\n-\n\n-\n82.45(.31) 81.60(.69) 82.52(.36)\n59.05(.14) 59.47(.56) 60.80(.47)\n\n-\n\nMAF\n\nGRU4Rec\n4.12(.19)\n4.03(.20)\n6.71(.50)\n4.97(.31)\n\nRNN+attn\n5.90(.17)\n4.66(.17)\n9.03(.44)\n7.36(.26)\n\nWalmart.com data\nTransRec\n7.03(.15)\n5.62(.17)\n10.38(.41)\n8.72(.26)\n\n-\n-\n8.63(.16)\n4.27(.91)\n6.92(.14)\n4.06(.94)\n12.49(.38) 7.66(.92)\n10.84(.26) 6.02(.99)\n\n-\n9.04(.31)\n7.27(.26)\n12.77(.65)\n10.95(.74)\nMAF\n\nTable 3: Performance metrics for the proposed apporach and baseline models. All results are\nconverted to percentage by multiplying by 100, and the standard deviations computed over ten runs\nare given in the parenthesis. The proposed methods and the best outcomes are highlighted in bold font.\nThe con\ufb01g rows give the optimal model con\ufb01guration for Bochner Inv CDF (among using MLP, MLP\n+ redisual block, MAF and NVP as CDF learning method) and Mercer (among k = 1, 5, . . . , 30).\n\nWe observe in Table 3 that the proposed time embedding with self-attention compares favorably to\nbaseline models on all three datasets. For the Stack Over\ufb02ow and Walmart.com dataset, Mercer\ntime embedding achieves best performances, and on MovieLens dataset the Bochner Non-para\noutperforms the remaining methods. The results suggest the effectiveness of the functional time\nrepresentation, and the comparison with positional encoding suggests that time embedding are more\nsuitable for continuous-time event sequence modelling. On the other hand, it appears that Bochner\nNormal and Bochner Inv CDF has higher variances, which might be caused by their need for sampling\nsteps during the training process. Otherwise, Bochner Inv CDF has comparable performances to\nBochner Non-para across all three datasets. In general, we observe better performances from Bochner\nNon-para time embedding and Mercer time embedding. Speci\ufb01cally, with the tuned Fourier basis\ndegree k, Mercer\u2019s method consistently outperforms others across all tasks. While d, the dimension\nof time embedding, controls how well the bandwidth of [!min,! max] is covered, k controls the\ndegree of freedom for the Fourier basis under each frequency. When d is \ufb01xed, larger k may lead to\nover\ufb01tting issue for the time kernels under certain frequencies, which is con\ufb01rmed by the sensitivity\nanalysis on k provided in Figure 1b.\nIn Figure 2, we visualize the average attention weights across the whole population as functions of\ntime and user action or product department on the Walmart.com dataset, to demonstrate some of the\nuseful temporal patterns captured by the Mercer time embedding. For instance, Figure 2a shows that\nwhen recommending the next product, the model learns to put higher attention weights on the last\nsearched products over time. Similarly, the patterns in Figure 2b indicate that the model captures the\nsignal that customers often have prolonged or recurrent attentions on baby products when compared\nwith electronics and accessories. Interestingly, when predicting the attention weights by using future\ntime points as input (Figure 2c), we see our model predicts that the users almost completely lose\nattention on their most recent purchased products (which is reasonable), and after a more extended\nperiod none of the previously interacted products matters anymore.\n\n8\n\n\fFigure 1: (a). We show the results of Bochner Inv CDF on the Movielens and Walmart.com dataset\nwith different distributional learning methods. (b). The sensitivity analysis on Mercer time encoding\non the Movielens dataset by varying the degree of Fourier basis k under different dimension d.\n\n(a) The temporal patterns in aver-\nage attention weight decay on the\nlast interacted product after differ-\nent user actions, as time elapsed.\n\n(b) The temporal patterns in aver-\nage attention weight decay on the\nlast viewed product from different\ndepartments, as time elapsed.\n\n(c) The prediction of future atten-\ntion weight on the last interacted\nproduct as a function of time and\ndifferent user actions.\n\nFigure 2: Temporal patterns and time-event interactions captured by time and event representations\non the Walmart.com dataset.\n\nDiscussion. By employing state-of-the-art CDF learning methods, Bochner Inv CDF achieves better\nperformances than positional encoding and other baselines on Movlielens and Walmart.com dataset\n(Figure 1a). This suggests the importance of having higher model complexity for learning the p(!) in\nBochner\u2019s Thm, and also explains why Bochner Normal fails since normal distribution has limited\ncapacity in capturing complicated distributional signals. On the other hand, Bochner Non-para is\nactually the special case of Mercer\u2019s method with k = 1 and no intercept. While Bochner\u2019s methods\noriginate from random feature sampling, Mercer\u2019s method grounds in functional basis expansion. In\npractice, we may expect Mercer\u2019s method to give more stable performances since it does not rely on\ndistributional learning and sampling. However, with advancements in Bayesian deep learning and\nprobabilistic computation, we may also expect Bochner Inv CDF to work appropriately with suitable\ndistribution learning models, which we leave to future work.\n\n8 Conlusion\n\nWe propose a set of time embedding methods for functional time representation learning, and\ndemonstrate their effectiveness when using with self-attention in continuous-time event sequence\nprediction. The proposed methods come with sound theoretical justi\ufb01cations, and not only do they\nreveal temporal patterns, but they also capture time-event interactions. The proposed time embedding\nmethods are thoroughly examined by experiments using real-world datasets, and we \ufb01nd Mercer time\nembedding and Bochner time embedding with non-parametric inverse CDF transformation giving\nsuperior performances. We point out that the proposed methods extend to general time representation\nlearning, and we will explore adapting our proposed techniques to other settings such as temporal\ngraph representation learning and reinforcement learning in the our future work.\n\n9\n\n\fReferences\n[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[2] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives.\n\nIEEE transactions on pattern analysis and machine intelligence, 35(8):1798\u20131828, 2013.\n\n[3] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua. Sca-cnn: Spatial and\nchannel-wise attention in convolutional networks for image captioning. In Proceedings of the\nIEEE conference on computer vision and pattern recognition, pages 5659\u20135667, 2017.\n\n[4] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attention-based models\nfor speech recognition. In Advances in neural information processing systems, pages 577\u2013585,\n2015.\n\n[5] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. arXiv preprint\n\narXiv:1605.08803, 2016.\n\n[6] N. Du, H. Dai, R. Trivedi, U. Upadhyay, M. Gomez-Rodriguez, and L. Song. Recurrent marked\ntemporal point processes: Embedding event history to vector. In Proceedings of the 22nd\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages\n1555\u20131564. ACM, 2016.\n\n[7] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM Trans.\nInteract. Intell. Syst., 5(4):19:1\u201319:19, Dec. 2015. ISSN 2160-6455. doi: 10.1145/2827872.\nURL http://doi.acm.org/10.1145/2827872.\n\n[8] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Session-based recommendations with\n\nrecurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.\n\n[9] D. Jackson. The theory of approximation, volume 11. American Mathematical Soc., 1930.\n\n[10] W.-C. Kang and J. McAuley. Self-attentive sequential recommendation. In 2018 IEEE Interna-\n\ntional Conference on Data Mining (ICDM), pages 197\u2013206. IEEE, 2018.\n\n[11] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[12] Y. Li, N. Du, and S. Bengio. Time-dependent representation for neural event sequence prediction.\n\narXiv preprint arXiv:1708.00065, 2017.\n\n[13] L. H. Loomis. Introduction to abstract harmonic analysis. Courier Corporation, 2013.\n\n[14] H. Mei and J. M. Eisner. The neural hawkes process: A neurally self-modulating multivariate\npoint process. In Advances in Neural Information Processing Systems, pages 6754\u20136764, 2017.\n\n[15] J. Mercer. Xvi. functions of positive and negative type, and their connection the theory of integral\nequations. Philosophical transactions of the royal society of London. Series A, containing\npapers of a mathematical or physical character, 209(441-458):415\u2013446, 1909.\n\n[16] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive \ufb02ow for density estimation.\n\nIn Advances in Neural Information Processing Systems, pages 2338\u20132347, 2017.\n\n[17] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in\n\nneural information processing systems, pages 1177\u20131184, 2008.\n\n[18] D. J. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. arXiv preprint\n\narXiv:1505.05770, 2015.\n\n[19] J. Tang and K. Wang. Personalized top-n sequential recommendation via convolutional sequence\nembedding. In Proceedings of the Eleventh ACM International Conference on Web Search and\nData Mining, pages 565\u2013573. ACM, 2018.\n\n10\n\n\f[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and\nI. Polosukhin. Attention is all you need. In Advances in neural information processing systems,\npages 5998\u20136008, 2017.\n\n[21] H. Widom. Asymptotic behavior of the eigenvalues of certain integral equations. ii. Archive for\n\nRational Mechanics and Analysis, 17(3):215\u2013229, 1964.\n\n[22] S. Xiao, J. Yan, M. Farajtabar, L. Song, X. Yang, and H. Zha.\n\nJoint modeling of event\nsequence and time series with attentional twin recurrent neural networks. arXiv preprint\narXiv:1703.08524, 2017.\n\n[23] S. Xiao, J. Yan, X. Yang, H. Zha, and S. M. Chu. Modeling the intensity function of point\nprocess via recurrent neural networks. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence,\n2017.\n\n[24] D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, and K. Achan. Context-aware dual representation\nlearning for complementary products recommendation. arXiv preprint arXiv:1904.12574v2,\n2019.\n\n[25] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio.\nShow, attend and tell: Neural image caption generation with visual attention. arXiv preprint\narXiv:1502.03044, 2015.\n\n[26] Q. Zhao, M. A. Erdogdu, H. Y. He, A. Rajaraman, and J. Leskovec. Seismic: A self-exciting\npoint process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 1513\u20131522. ACM,\n2015.\n\n[27] Y. Zhu, H. Li, Y. Liao, B. Wang, Z. Guan, H. Liu, and D. Cai. What to do next: Modeling user\n\nbehaviors by time-lstm. In IJCAI, pages 3602\u20133608, 2017.\n\n11\n\n\f", "award": [], "sourceid": 9358, "authors": [{"given_name": "Da", "family_name": "Xu", "institution": "Walmart Labs"}, {"given_name": "Chuanwei", "family_name": "Ruan", "institution": "Walmart Labs"}, {"given_name": "Evren", "family_name": "Korpeoglu", "institution": "Walmart Labs"}, {"given_name": "Sushant", "family_name": "Kumar", "institution": "Walmart Labs"}, {"given_name": "Kannan", "family_name": "Achan", "institution": "Walmart Labs"}]}