Practical Session #9 - Heap, Lempel-Ziv

Practical Session #9 - Heap, Lempel-Ziv Heap Heap Maximum- Heap Minimum- Heap Heap-Array A binary heap can be considered as a complete binary tree, (the last level is full from the left to a certain point). For each node x, x left(x) and x right(x) The root in a Maximum-Heap is the maximal element in the heap For each node x, x left(x) and x right(x) The root in a Minimum-Heap is the minimal element in the heap A heap can be represented using an array: 1. A[1] - the root of the heap (not A[0]) 2. length[a] - number of elements in the array 3. heap-size[a] - number of elements in the heap represented by the array A 4. For each index i : Parent(i) = Left(i) = 2i i 2 Right(i) = 2i+1 Actions on Heaps Insert ~ O(log(n)) Max ~ O(1) Extract-Max ~ O(log(n)) Build-Heap ~ O(n) Down-Heapify(H, i) same as MaxHeapify seen in class but starting in the node in index i ~ O(log(n)) Up-Heapify(H, i) Fixing the heap going from node in index I to the root as in Insert() ~ O(log(n))

Question 1 Given a maximum heap H, What is the time complexity of finding the 3 largest keys in H? What is the time complexity of finding the C largest keys in H? (C is a constant) What if C is not constant? Solution: It takes O(1) to find the 3 maximal keys in H. The 3 maximal keys in H are: The root The root's maximal son y The largest among y's maximal son and y's sibling Finding the C largest keys in H is done in a similar way: The i-th largest key is one of at most i keys, the sons of the i-1 largest key that have not been taken yet, thus can be found in O(i) time. Finding the C largest keys takes O(C 2 ) = O(1) (Another solution would be ordering all the keys in the first C levels of the heap) If C is not constant we have to be more efficient, We will use a new heap, where we will store nodes of the original heap so that for each node entered we will have it sons as put in this new heap but we also preserve access to its sons as in the original heap. We first enter the root to the new heap. Then C times we do Extract-Max() on the new heap while at each time we Insert the original sons (the sons as in the original heap) to the new heap. The logic is as before only this time since the new heap has at most C elements we get O(C*log(C)). (note that we do not change the original heap at any point) Question 2 Analyze the complexity of finding the i smallest elements in a set of size n using each of the following: Sorting Priority Queue Sorting Solution: Sort the numbers using algorithm that takes O(n log n) Find the i smallest elements in O(i) Total = O(i + n log n) = O(n log n) Priority Queue Solution: Build a minimum heap in O(n) Execute Extract-Min i times in O(i log n) Total = O(n + i log n). Using question 1 this can be reduced to O(n + i log i). Note: n + i log i = Θ(max(n, i log i)) and clearly n + i log n = O(n log n) (and faster if i Θ(n))

Question 3 In a given heap H that is represented using an array; suggest an algorithm for extracting and returning the i'th indexed element. Solution: Delete(H, i) if (heap_size(h) < i) error( heap underflow ) key H[i] H[i] H[heap_size] /*Deleting the i'th element and replacing it with the last element*/ heap_size heap_size 1 /*Effectively shrinking the heap*/ Down-Heapify(H, i) /*Subtree rooted in H[i] is legal heap*/ Up-Heapify(H, i) /*Area above H[i] is legal heap*/ return key Question 4 You are given two min-heaps H 1 and H 2 with n 1 and n 2 keys respectively. Each element of H 1 is smaller than each element of H 2. How can you merge H 1 and H 2 into one heap H (with n 1 + n 2 keys) using only O(n 2 ) time? (Assume both heaps are represented in arrays of size > n 1 + n 2 ) Solution n 1 n 2 : Add the keys of H 2 to the array of H 1 from index n 1 +1 to index n 1 + n 2 1. The resulting array represents a correct heap due to the following reasoning. Number of leaves in H1 is l 1, hence if l 1 is even the number of internal nodes is l 1-1 and there are 2l 1 "openings" for new leaves. l 1 + l 1 1 = n 1 2l 1 1 = n 1 2l 1 > n 1 n 2 If l 1 is odd then there are l 1 internal nodes and 2l 1 + 1 n 2 openings All keys from H 2 are added as leaves due to the fact stated in formula 2l 1 > n 2. As all keys in H 1 are smaller than any key in H 2, the heap property still will hold in the resulting array. n 1 < n 2 : Build a new heap with all the elements in H 1 and H 2 in O(n 1 + n 2 ) = O(2n 2 ) = O(n 2 ).

Question 5 Give an O(nlogk) time algorithm to merge k sorted arrays A 1.. A k into one sorted array. n is the total number of elements (you may assume k n). Solution We will use a min-heap of k triples of the form (d, i, A d [i]) for 1 d k, 1 i n. The heap will use an array B. 1. First we build the heap with the first elements lists A 1...A k in O(k) time. 2. In each step extract the minimal element of the heap and add it at the end of the output Array M. 3. If the array that the element mentioned in step two came from is not empty, than remove its minimum and add it to our heap. for d 1 to k B[d] (d, 1, A d [1]) Build-Heap(B) /*By the order of A d [1] */ for j 1 to n (d, i, x) Extract-Min(B) M[j] x if i < A d. length Heap-Insert(B,(d, i + 1, A d [i + 1])) Worst case time analysis: Build-Heap : O(k) Extract-Min : O(log k) Heap-Insert : O(log k) Total O(nlogk)

Question 6 (For practice at home) Suppose that you had to implement a priority queue that only needed to store items whose priorities were integer values between 0 and k 1. Describe how a priority queue could be implemented so that insert( ) has complexity O(1) and RemoveMin( ) has complexity O(k). Be sure to give the pseudo-code for both operations. Hint: This is very similar to a hash table. Solution Use an array A of size k of linked lists. The linked list in the ith slot stores entries with priority i. To insert an item into the priority queue with priority i, append it to the linked list in the ith slot. insert(v) A[v. priority].add-to-tail(v) To remove an item from the queue, scan the array of lists, starting at 0 until a non-empty list is found. Remove the first element and return it the for-loop induces the O(k) worst case complexity. E.g., removemin( ) for i 0 to k-1 List L A[i] if( L.empty() = false ) v L.remove-from-head( ) /*Removes list's head and returns it = dequeue*/ return v return NULL

אלגוריתם :Lempel-Ziv אלגוריתם לדחיסת נתונים. האלגוריתם מאפשר לשחזר את המידע הדחוס במלואו. משפחה של אלגוריתמים שמקורם בשני אלגוריתמים עיקריים שפותחו על ידי יעקב זיו ואברהם למפל בשנת 1977-8 יעקב זיו אברהם למפל GIF image אלגוריתמים ששייכים למשפחה זו של אלגוריתמים, מיושמים באפליקציות כגון ZIP ו -.compression 1 בתרגול זה נראה את אחת מהגרסאות של האלגוריתם שמקורו באלגוריתם משנת.LZ78-1978 הרעיון שעומד בבסיסו של הקידוד הוא: כל מילה בקידוד היא המילה הארוכה ביותר שנראתה עד כה בתוספת של אות אחת. על מנת לעשות חיפוש יעיל של המילים שכבר קודדו נממש מבנה נתונים עצי שנקרא Trie או Prefix tree )ראו דוגמה בהמשך(. העץ יהיה מעין מילון שנבנה בצורה דינמית ומייצג מילים שכבר קודדו בצורה יעילה. מבנה הנתונים הוא היררכי. לכל צומת יש אבא ורשימה של ילדים. בנוסף כל צומת צריכה לשמור שני שדות: אינדקס המסמן באיזה שלב הצומת הזאת נוצרה ותו שמחבר את הצומת הזאת לאבא שלו )מידע שמופיע על הצלעות בדוגמה המופיעה מטה(. צלעות שיוצאות מצומת מסוימת מייצגות אותיות שונות ששיכות לא"ב של הטקסט המקודד. השרשור של האותיות במסלול מהשורש לצומת מסוימת מייצגות מילים או רישאות של מילים שקודדו. לפניכם הפסאודו קוד של אלגוריתם הקידוד המקבל Text כקלט. מומלץ לעבור עליו במקביל עם הדוגמה המופיעה בהמשך. הפסאודו קוד של האלגוריתם: אינדקס= 0 1. צור רשימת זוגות ריקה. 2. צור שורש של העץ. סמן אותו בערכו של האינדקס. 3. כל עוד אורך ה > Text 0: 4. קדם את האינדקס ב- 1. 4.1. מצא את הרישא )prefix( הארוכה ביותר של Text שמופיעה בעץ, במסלול כלשהו מהשורש 4.2.. לאיזושהי צומת v אם אורך הטקסט > אורך הרישא 4.3. הוסף לצומת v בן חדש, סמן אותו באינדקס, חבר ביניהם בתו שבא אחרי הרישא ב-.Text 4.3.1. קצץ מתחילת הטקסט את הרישא והאות שבאה אחריה. 4.3.2. הוסף לרשימת הזוגות, זוג שמורכב מהאינדקס של צומת v ומהאות שבאה אחרי הרישא )במידה 4.4. 2 ואין אות כזו, כלומר הגענו לסוף המילה הוסיפו תו '*' (. הניחו שהטקסט המקורי לא יכיל את התו '*'.

1 החיפוש יתבצע על ידי טיול במקביל על הטקסט ועל הצמתים של העץ: מתחילים מהאות הראשונה של הטקסט ומהשורש של העץ. נסמן את האות הנוכחית בטקסט ב- c ואת הצומת הנוכחית בעץ ב- v. מחפשים מבין הבנים של v האם מישהו מחובר אליו באות c. אם כן, מתקדמים אליו )עכשיו הוא ) v ומתקדמים בטקסט לאות הבאה )עכשיו היא c(. חוזרים על התהליך עד שלא ניתן להתקדם מצומת v עם האות c. כשמסיימים, מחזירים את הצומת v )זו הצומת שאליה נחבר בן חדש בשלב 4.3.1(. 2 ראו את השלב האחרון של הדוגמה למטה. הדגמה של ריצת האלגוריתם על הטקסט הבא B: בונים עץ התחלתי ששורשו מסומן ב- 0. מאתחלים את רשימת הזוגות אשר מייצגת את הקידוד להיות רשימה ריקה ואת האינדקס הרץ מאתחלים לאפס. מקדמים את האינדקס ל- 1 )אנו כעת במילה הראשונה שמקודדת(. עוברים על הטקסט. מתחילים עם האות הראשונה בטקסט ועם השורש של העץ. לשורש אין בן שמחובר אליו ב- a ולכן עוצרים את החיפוש כאשר v זה השורש והרישא היא מילה ריקה. מוסיפים לצומת v בן חדש המסומן באינדקס 1, אשר מחובר לאבא שלו באות a. מוסיפים לרשימת הזוגות זוג חדש (a,0) אשר מסמן שהמילה הראשונה מורכבת מהרישא שנמצאת על המסלול מהשורש לצומת המסומנת ב- 0 )מילה ריקה במקרה זה( בתוספת של האות a. מורידים מהטקסט את הרישא ואת האות a וחוזרים עם הטקסט המקוצץ acgacgat לתחילת הלולאה.

מקדמים את האינדקס ל- 2. מחפשים את הרישא הארוכה ביותר של הטקסט שנמצאת בעץ a. הרישא מסתיימת בצומת המסומנת ב- 1. מוסיפים לצומת בן חדש עם אינדקס 2 ומחברים אותה לצומת האב שלה עם האות שבאה אחרי הרישא c. מוסיפים לרשימת הזוגות את הזוג (c,1) שמסמן שהמילה השנייה מורכבת מהרישא שנמצאת על המסלול מהשורש לצומת 1, כלומר a ותוספת של האות c. מקצצים את המילה, ונשארים עם.gacgat מקדמים את האינדקס ל- 3. מחפשים את הרישא הארוכה ביותר של הטקסט שנמצאת בעץ מילה ריקה. הרישא מסתיימת בצומת המסומנת ב- 0 )בשורש(. מוסיפים לצומת בן חדש עם אינדקס 3 ומחברים אותה לצומת האב שלה עם האות שבאה אחרי הרישא g. מוסיפים לרשימת הזוגות את הזוג (g,0) שמסמן שהמילה השלישית מורכבת מהרישא שנמצאת על המסלול מהשורש לצומת 0, כלומר מילה ריקה ותוספת של האות g. מקצצים את המילה, ונשארים עם.acgat

מקדמים את האינדקס ל- 4. מחפשים את הרישא הארוכה ביותר של הטקסט שנמצאת בעץ.ac הרישא מסתיימת בצומת המסומנת ב- 2. מוסיפים לצומת בן חדש עם אינדקס 4 ומחברים אותה לצומת האב שלה עם האות שבאה אחרי הרישא g. מוסיפים לרשימת הזוגות את הזוג (g,2) שמסמן שהמילה הרביעית מורכבת מהרישא שנמצאת על המסלול מהשורש לצומת 2, כלומר ac ותוספת של האות g. מקצצים את המילה, ונשארים עם.at מקדמים את האינדקס ל- 5. מחפשים את הרישא הארוכה ביותר של הטקסט שנמצאת בעץ a. הרישא מסתיימת בצומת המסומנת ב- 1. מוסיפים לצומת בן חדש עם אינדקס 5 ומחברים אותה לצומת האב שלה עם האות שבאה אחרי הרישא t. מוסיפים לרשימת הזוגות את הזוג (t,1) שמסמן שהמילה החמישית מורכבת מהרישא שנמצאת על המסלול מהשורש לצומת 1, כלומר a ותוספת של האות t. מקצצים את המילה, ונשארים עם מילה ריקה. הקידוד הסופי עבור המילה B הינו רשימת הזוגות שאגרנו במשתנה.Pairs כעת נניח שהטקסט B הוא הטקסט הקודם בתוספת של האות a: עושים את חמשת השלבים הקודמים. בשלב השישי, מקדמים את האינדקס ל- 6. מחפשים את הרישא הארוכה ביותר של הטקסט שנמצאת בעץ a. הרישא מסתיימת בצומת המסומנת ב- 1. הרישא הארוכה ביותר מגיעה לסוף הטקסט ואין איך להאריך אותה יותר באות אחת. ולכן, לא עושים יותר שינויים על העץ. כדי לא לאבד אינפורמציה על הקידוד של המילה הזו אנחנו מוסיפים את הזוג (*,1) לרשימת הזוגות. נקודה למחשבה : מה קורה עם נבצע זאת עם הטקסט ** lovesmedoesnotlovemelovesmedoesnotlovemelovesmedoesnotlovemelovesmedoesnotloveme שחזור: ניתן לשחזר את הטקסט המקודד מתוך רשימת הזוגות בלבד בדרך הבאה:

נחזיק מערך. ואינדקס שמתחיל מ 1. האיבר ה 0 במערך יהיה ריק. עבור כל זוג,(i,x) נשרשר למילה שבמקום הi המערך את האות x ונכתוב את התוצאה במקום האינדקס במערך. כמו כן נוסיף את התוצאה לטקסט המשחוזר ונקדם את אינדקס ב 1. )במידה ונתקלנו ב* רק נוסיף את המילה לטקסט המשוחזר( דוגמא:

Question 7 A text with an alphabet consisting of 27 possible characters ('a','b','c',,'z',' ') is coded by Lempel-Ziv into n pairs. What is the length of the longest possible text that was encoded? Give an example when this can happen. What is the length of the shortest possible text that was encoded? Give an example when this can happen. Solution The longest text happens when the first pair is an encoding of length 1, the second pair an encoding of length 2,, the nth pair is an encoding of length n. Therefore, the length of the longest text is 1 + 2 + 3 + + n = n(n+1) = θ(n 2 ) This can happen if the text is all the same character, for example "aaaa " Another way this can happen is the text "aababcabcd " The shortest possible text happens for a completely balanced Trie. If we denote by L the number of levels in a completely balanced Trie, then L log 27 (n): The first level has 27 nodes, the second level has 27 2 nodes,, the L th level has 27 L nodes. So n = 27 + 27 2 + + 27 L L log 27 n Therefore, the shortest text has length 27 + 2 27 2 + + L 27 L = 27 + 2 27 2 + + log 27 (n) 27 log 27 (n) = θ(n log(n)) One way that this can happen is the text "abcd z aaabacad " 2