Mini-Project 4 | CSC/ECE 574 - Computer and Network Security CSC/ECE 574 - Computer and Network Security SyllabusScheduleResearch ProjectsMini-Projects Mini-Project 4 (Not Finalized) Due: Fri Apr 22 Points: 100 The goal of this project is to gain a deeper understanding of mobile application privacy and to develop skills for empirical investigation. Specifically, students will use the Amandroid (also called Argus-SAF) static program analysis tool to identify privacy leaks in Android applications. For example, it can identify if privacy sensitive information from the deviceID() source API flows to a network send() sink API. Amandroid will be applied to a small corpus of applications, and a report will classify and describe the findings. What to submit: This assignment should be submitted to GradeScope. You should include a PDF document that provides your written answers to the questions for each part. Collaboration: Each student must submit their own solution. That said, students are encouraged to discuss the project with each other. Program analysis tools can sometimes be tricky to get working, and class discussion will facilitate understanding. The only rule is that the “findings” should be unique. One way to ensure uniqueness is for collaborating students to work on different sets of applications. Finally, if you discuss the project or any insights with another student, you must list their name on your solution PDF in a clearly denoted “Acknowledgements” section. Dataset: From a May 2019 snapshot of the top 500 most popular free apps in the Google Play Store, I’ve taken an approximation of the apps used for the PoliCheck study. These applications were detected to have privacy leaks during dynamic analysis. I then took the smallest 400 apps (because Amandroid is quite slow) and distributed them in to 16 buckets using a round-robin sorting algorithm (each bucket has about the same distribution of file sizes). I then created .zip files for each bucket: mp4-appset-[0-F].zip. There are 25 apps in each dataset. You should choose the app dataset that corresponds to the first hexadecimal digit of the SHA256 hash of your last name in lowercase. For example, I would choose mp4-appset-1.zip: % echo "enck" | openssl sha256
1617c29f14346b3f61efc05bbc03d301e02716d73dfc61c8450c1176525695e8
In your report, specify the command you used and the output. If you decide to collaborate with someone who has a collision (same app dataset), email the TA and instructor for an exception. Sidenote: I calculated the smallest app by the size of the .apk. It would have been better to use the size of the (potentially multiple) .dex files inside the .apk, as there may be apps with smaller code sizes, but more graphic and audio resources. Downloading App Datasets: Using your NCSU Google account, you can download the app datasets using the link posted Moodle for Mini-Project 4. Environment Setup Amandroid is a static program analysis tool for Android appliations. It works directly on .apk files, so no need to have the source code! This is because Android applications are written in Java and compiled into a special bytecode called Dalvik. Researchers have figured out how to retarget Dalvik bytecode back to Java bytecode while retaining most of the semantics. Therefore, it is much easier to perform static program analysis on Dalvik bytecode than machine code (e.g., an ARM binary). Note that Amandroid cannot analyze native code (e.g., ARM binary libraries) in an Android application. Amandroid builds a Data Dependence Graph (DDG) by first constructing a control flow graph and calculating points-to information for each instruction. There are different ways the DDG can be used to answer interesting program analysis questions. For this assignment, you will be using the DDG to perform a specific type of program analysis called taint analysis, also known as data-flow analysis. Conceptually, a taint analysis defines a set of taint sources and a set of taint sinks. The taint sources are code locations of information that you care about. The taint sinks are code locations of where you care that information goes. The taint analysis then tracks the propagation of the information from a taint source to a taint sink. Amandroid allows you to specify taint sources and sinks via a text configuration file, or to define your own custom source and sink manager. For this assignment, you only need to use the text file. Amandroid is an open source project. Its web page describes how to download, setup, and run the tool. Follow these instructions to setup your environment. We will be using Amandroid as a command-line tool. Note: pay attention to the command line options. Amandroid has several modules. You only need to worry about taint analysis for privacy sensitive information. Options can be found using the command line tool. For example, java -jar argus-saf\_***-version-assembly.jar t will show information about the latest options for taint mode. The following command line options will be useful for this assignment. java -Xmx8g -jar ...: The -Xmx option for java tells Java how much RAM to allocate t the analysis. For example, -Xmx8g specifies 8GB of RAM. -mo: You will need to specify the type of taint analysis you wish to perform. For this project, you are interested in the DATA_LEAKAGE analysis. -i: You may wish to define a custom configuration file (e.g., to define a custom source and sink list file). -o: Specify an output path for this analysis. You will want to use different output paths for each app. -a COMPONENT_BASED: This option was added to handle Inter Component Communication (ICC) in a more scalable way. The documentation suggests that if you use this approach, your configuration should turn off resolve_icc. I expect that Amandroid will perform the analysis of each component separately, and then attempt to connect ICC using the program graphs created for each component. Hence, using the -a COMPONENT_BASED option should make the analysis take less RAM and finish faster. As project documentation does not always keep up with code, some experimentation may be needed here. Note: You may want to write a script to process the output of Amandroid. Alternatively, you could explore how to extend Amandroid with your own custom output. Note: You may wish to ignore some of the pre-defined taint sources or sinks. If you are not sure what an API is used for, check the developer documentation for Android. A custom source and sink list file can be specified in the Amandroid configuration file (-i option). Question 1: High Level Statistics (50 points) Run Amandroid on the 25 applications in your assigned dataset (you may wish to write a script to batch the execution). Note that Amandroid can consume a large amount of RAM for its analysis, so you will likely need to increase the amount of RAM that Amandroid may use (anticipate allocated 8-16 GB if possible). It can also run for a long time for apps (most apps should finish in less than 10 minutes, but some may take upwards of an hour, depending on the hardware used for the analysis). Therefore, you may need to make reasonable adjustments. If it fails to run for only a handful of the applications, simply note this in your experimental results. If it does not run for a significant percentage of the applications, please contact the TA and instructor for guidance. Your answer for this question should describe your experiences running Amandroid. If you wrote a short script to batch the execution, include the script text in the report. Also provide a table that consolidates the high level statistics of the output of Amandroid. For example, you can semantically group taint sources (e.g., geographic location, microphone, device identifiers) and taint sinks (e.g., file, network). Your table can report how many applications have flows from groups of taint sources to groups of taint sinks. Be creative in how you present this information. Accompany the table with a short description offering insights and observations into the reported data. Note: As stated above, it may be helpful to write a small script to process the output of Amandroid. Question 2: Determining Privacy Violations (50 points) The existence of a data flow from a privacy sensitive source to a network sink does not imply a privacy violation. A privacy violation occurs when the user is not reasonably aware that the flow occurs. That flow may be obvious from the user interface (e.g., location is sent when the user clicks “find my location”) or from the description of the app (e.g., a map application is expected to send location). The flow might also be stated in a privacy policy or EULA shown when the app first loads. If you find >=5 apps with data flows to network sinks Choose five applications that Amandroid reported as having data flows from privacy sensitive sources to the network. Your goal for this question is to report, for each of the five applications, why you think the detected flows are or are not privacy violations. How you determine this is open-ended. You may want to run the .apk in an Android emulator. You should strongly consider decompiling or disassembling the .apk to look at the code. If you find <5 apps with data flows to network sinks Ultimately, the goal of this question is for the student to understand the limitations of program analysis tools. One of these limitations is false positives, which you are asked to investigate above. Another limitation is false negatives, e.g., due to timeouts and tool crashes. If a dataset results in less than 5 applications with network taint sinks and you cannot complete the false positive analysis described above, the student should: Describe their process of identifying the remaining applications (up to 5) that have a likelihood of being a false negative (e.g., contains a network send API call, contains a suspicious ad library, etc.) Perform a deeper analysis of those applications (e.g., via decompilation) to investigate and report as to whether or not those applications were false negatives. Resources Here are a few Android application decompilers / disassemblers to consider: JEB (free trial available) AndroGuard backsmali (dissembler) JD-GUI (requires using a tool to convert DEX to Java Class files, e.g., dex2jar, dare, or ded). CSC/ECE 574 - Computer and Network Security CSC/ECE 574 - Computer and Network Security whenck@ncsu.edu This is the course website for the Spring 2022 offering of CSC/ECE 574, Computer and Network Security, at the North Carolina State University.